Posted on 2024-11-12

Metrics for What, Metrics for Whom?

Assessing Actionability of Bias Evaluation Metrics in NLP



In our EMNLP paper, we investigate the critical but often overlooked concept of "actionability" in NLP bias measures - essentially, how useful these metrics are for taking concrete steps to address bias. Through analyzing 146 papers on bias measurement in NLP, we discovered that crucial elements like intended use cases and reliability assessments are frequently missing or unclear, creating a significant gap between measuring bias and actually addressing it. Based on these findings, we propose new guidelines for developing and documenting bias metrics that can more effectively drive real-world improvements in NLP systems.


Our analysis of 146 papers reveals significant gaps in how bias measures are reported in NLP. We introduce "actionability" as a framework and provide concrete recommendations for creating more impactful bias measures.

Surveying bias measures

When we measure bias in NLP systems, we want these measurements to lead to meaningful actions. Whether that's improving models, adjusting deployment strategies, or informing policy decisions. However, when surveying 146 papers that propose bias measures, we found that many measures lack crucial elements that would make them actionable:

  • 47% don't specify their intended use
  • 25% lack a clear definition of what bias they measure
  • Only 28 out of 146 papers discuss reliability
  • Many measures lack clear connections to real-world impacts

What Makes a Measure Actionable?

We introduce actionability: the degree to which a measure's results enable informed action. For a bias measure to be actionable, it needs:

  • Clear motivation and intended use
  • Well-defined theoretical bias construct
  • Explicit interval and ideal results
  • Specified conditions for meaningful results
  • Reliability assessment

These elements work together to ensure that measuring bias leads to meaningful change. Without clear motivation, we can't assess if a measure fits our needs. Without a well-defined construct, we risk measuring the wrong thing. Without explicit intervals and reliability assessments, we can't understand our results. And without understanding the conditions for meaningful results, we might misapply measures in contexts where they don't work. Our analysis shows that current papers often miss several of these crucial elements, limiting their practical impact.

Our Recommendations

Be Clear About Motivation & Use

  • Explicitly state why new measure is needed
  • Specify which issues it addresses
  • Define scope of applicable settings

Select & Report Theoretical Construct

  • Define underlying bias construct
  • Explain how behaviors reflect in measurements
  • Prevent construct-operationalization conflation

Relate Measures to Consequences

  • Ground values in expected behaviors
  • Explain social ramifications
  • Define ideal scores and extrema

Always Assess Reliability

  • Test for consistency across different contexts
  • Report error margins and confidence intervals
  • Document sources of uncertainty and variation
Only 28 out of 146 papers discuss reliability

Consider Target Audience

  • Account for different stakeholder needs
  • Consider varying action capabilities
  • Enable appropriate interventions for each group
Different stakeholders require different types of actionability

Want to create more impactful bias measures? Read our EMNLP paper for detailed recommendations and examples.

Linked publications

Metrics for What, Metrics for Whom: Assessing Actionability of Bias Evaluation Metrics in NLP 2024 Pieter Delobelle, Giuseppe Attanasio, Debora Nozza, Su Lin Blodgett, Zeerak Talat Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing read paper