In our EMNLP paper, we investigate the critical but often overlooked concept of "actionability" in NLP bias measures - essentially, how useful these metrics are for taking concrete steps to address bias. Through analyzing 146 papers on bias measurement in NLP, we discovered that crucial elements like intended use cases and reliability assessments are frequently missing or unclear, creating a significant gap between measuring bias and actually addressing it. Based on these findings, we propose new guidelines for developing and documenting bias metrics that can more effectively drive real-world improvements in NLP systems.

Our analysis of 146 papers reveals significant gaps in how bias measures are reported in NLP. We introduce "actionability" as a framework and provide concrete recommendations for creating more impactful bias measures.

Surveying bias measures

When we measure bias in NLP systems, we want these measurements to lead to meaningful actions. Whether that's improving models, adjusting deployment strategies, or informing policy decisions. However, when surveying 146 papers that propose bias measures, we found that many measures lack crucial elements that would make them actionable:

47% don't specify their intended use
25% lack a clear definition of what bias they measure
Only 28 out of 146 papers discuss reliability
Many measures lack clear connections to real-world impacts

What Makes a Measure Actionable?

We introduce actionability: the degree to which a measure's results enable informed action. For a bias measure to be actionable, it needs:

Clear motivation and intended use
Well-defined theoretical bias construct
Explicit interval and ideal results
Specified conditions for meaningful results
Reliability assessment

These elements work together to ensure that measuring bias leads to meaningful change. Without clear motivation, we can't assess if a measure fits our needs. Without a well-defined construct, we risk measuring the wrong thing. Without explicit intervals and reliability assessments, we can't understand our results. And without understanding the conditions for meaningful results, we might misapply measures in contexts where they don't work. Our analysis shows that current papers often miss several of these crucial elements, limiting their practical impact.

Our Recommendations

Be Clear About Motivation & Use

Explicitly state why new measure is needed
Specify which issues it addresses
Define scope of applicable settings

Select & Report Theoretical Construct

Define underlying bias construct
Explain how behaviors reflect in measurements
Prevent construct-operationalization conflation

Relate Measures to Consequences

Ground values in expected behaviors
Explain social ramifications
Define ideal scores and extrema

Always Assess Reliability

Test for consistency across different contexts
Report error margins and confidence intervals
Document sources of uncertainty and variation

Only 28 out of 146 papers discuss reliability

Consider Target Audience

Account for different stakeholder needs
Consider varying action capabilities
Enable appropriate interventions for each group

Different stakeholders require different types of actionability

Want to create more impactful bias measures? Read our EMNLP paper for detailed recommendations and examples.

paper

Metrics for What, Metrics for Whom?

Assessing Actionability of Bias Evaluation Metrics in NLP