Pieter Delobelle | Towards Fairer LLMs

Abstract

Bias in language models, traced from the Bloomberg ChatGPT‑as‑recruiter case through allocational and representational harms — and an honest look at where the research has and hasn't actually moved.

A two‑part story. Part 1: how bias gets locked in by automated decision systems (SyRI, COMPAS, the Polish public employment service), and why “human‑in‑the‑loop” rarely rescues it. Part 2: what we can actually measure and mitigate, from RobBERT‑era intrinsic metrics through the multilingual SHADES benchmark, to inference‑time interventions like AurA.

Outline

What the talk covers

01 The recruiter problem. Bloomberg found GPT ranked applicants with Black‑female‑associated names 36% less often for software engineering roles. Why this is structural, not anecdotal.
02 A harms taxonomy. Representational vs. allocational harms — and why recourse becomes nearly impossible once decisions get automated (SyRI in the Netherlands, COMPAS in the US, the Polish job‑centre case where 0.58% of profiles were ever overruled).
03 Allocational harms. Fairness as error‑rate parity (separation), the unavoidable trade‑off in any binary classifier, and ProbLog4Fairness as a way to model bias mechanisms directly.
04 Representational harms. Gender bias in BERT (the “actrice”/“huisvrouw” case in RobBERT), ResumeTailor on nationality‑by‑profession in Dutch CVs, and the multilingual SHADES stereotype benchmark.
05 Metrics actionability. Why most intrinsic bias metrics don't correlate with each other or with downstream task bias — and what makes a metric worth using anyway.
06 Inference‑time control. DExperts, expert neurons (Suau et al.), CtrlG, and AurA / Whispering Experts — suppressing toxicity‑related experts at generation time. Where SAEs might fit next.

What the talk covers

The deck