Our analysis of 146 papers reveals significant gaps in how bias measures are reported in NLP. We introduce "actionability" as a framework and provide concrete recommendations for creating more impactful bias measures.
Surveying bias measures
When we measure bias in NLP systems, we want these measurements to lead to meaningful actions. Whether that's improving models, adjusting deployment strategies, or informing policy decisions. However, when surveying 146 papers that propose bias measures, we found that many measures lack crucial elements that would make them actionable:
- 47% don't specify their intended use
- 25% lack a clear definition of what bias they measure
- Only 28 out of 146 papers discuss reliability
- Many measures lack clear connections to real-world impacts
What Makes a Measure Actionable?
We introduce actionability: the degree to which a measure's results enable informed action. For a bias measure to be actionable, it needs:
- Clear motivation and intended use
- Well-defined theoretical bias construct
- Explicit interval and ideal results
- Specified conditions for meaningful results
- Reliability assessment
These elements work together to ensure that measuring bias leads to meaningful change. Without clear motivation, we can't assess if a measure fits our needs. Without a well-defined construct, we risk measuring the wrong thing. Without explicit intervals and reliability assessments, we can't understand our results. And without understanding the conditions for meaningful results, we might misapply measures in contexts where they don't work. Our analysis shows that current papers often miss several of these crucial elements, limiting their practical impact.
Our Recommendations
Be Clear About Motivation & Use
- Explicitly state why new measure is needed
- Specify which issues it addresses
- Define scope of applicable settings
Select & Report Theoretical Construct
- Define underlying bias construct
- Explain how behaviors reflect in measurements
- Prevent construct-operationalization conflation
Relate Measures to Consequences
- Ground values in expected behaviors
- Explain social ramifications
- Define ideal scores and extrema
Always Assess Reliability
- Test for consistency across different contexts
- Report error margins and confidence intervals
- Document sources of uncertainty and variation
Consider Target Audience
- Account for different stakeholder needs
- Consider varying action capabilities
- Enable appropriate interventions for each group
Want to create more impactful bias measures? Read our EMNLP paper for detailed recommendations and examples.