Thanks to transformer-based foundation models, natural language processing has lately been at the forefront of many AI-related innovations. These models learn the distribution of vast amounts of training data exceptionally well and can be re-used and applied for a wide range of tasks, which is why they are considered foundational. However, these foundation models also reproduce undesirable traits from their training data, such as reproducing hate speech or reinforcing gender stereotypes.
In this thesis, we investigate these undesirable traits in language models along three axes. First, we focus on measurements and highlight the issue of lack of good measurements for societal biases, which hinders progress. We provide a literature survey and show that existing metrics are not consistent. Based on these findings, this thesis offers a way forward.
Second, we move to mitigation methods and contribute a novel method to reduce certain undesirable traits, as we demonstrate by reducing gender stereotypes in English and Dutch foundational models. Reducing these stereotypes—called mitigation in the literature—is not trivial, however, and we additionally demonstrate that adversarial methods can yield unexpected results using a well-studied classification task.
Third, we focus our attention on the Dutch language. Not only are most resources released for the English language, but the assumptions behind many metrics are also based on English grammar and Anglosaxon culture. We address both issues by (i) releasing a series of Dutch foundational models, called RobBERT, where we also investigate cheaper and more ecological training regimes and (ii) investigating gender bias in RobBERT.
With these contributions, this thesis bundles research on measuring and reducing a range of undesirable traits in foundational language models, paving the way to fairer models for English and other languages.