Posted on 2022-11-15

RobBERT-2022

Updating a Dutch Language Model to Account for Evolving Language Use



We update the RobBERT Dutch language model to include new high-frequent tokens present in the latest Dutch OSCAR corpus from 2022. We then pre-train the RobBERT model using this dataset. Our new model is a plug-in replacement for RobBERT and results in a significant performance increase for certain language tasks.


RobBERT-2022 is the newest release of the Dutch RobBERT model. Since the original release in January 2020, some things happened and our language evolved. For instance, the COVID-19 pandemic introduced a wide range of new words that were suddenly used daily. To account for this and other changes in language use, we release a new Dutch BERT model trained on data from 2022.

Thanks to this more recent dataset, this model shows increased performance on several tasks related to recent events, e.g. COVID-19-related tasks. We also found that for some tasks that do not contain more recent information than 2019, the original RobBERT model can still outperform this newer one.

Creating RobBERT-2022

I've detailed the development of RobBERT-2022 in two blogposts: part one and part two, but the gist is that we added almost 3k new tokens to our vocabulary and then pre-trained our model based on the old weights.

We evaluated RobBERT-2022 on the same benchmark tasks as RobBERT and on two COVID-19-related tasks and the model performs very well on most tasks.

Usage

RobBERT-2022 is a plug-in replacement for the original RobBERT model, so you can just update your code to use the new identifier in the HuggingFace Transformers library.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("DTAI-KULeuven/robbert-2022-dutch-base")
model = AutoModelForSequenceClassification.from_pretrained("DTAI-KULeuven/robbert-2022-dutch-base")

Linked publications

RobBERT: a Dutch RoBERTa-based Language Model 2020 Pieter Delobelle, Thomas Winters, Bettina Berendt Findings of the Association for Computational Linguistics: EMNLP 2020 read paper
RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use 2022 Pieter Delobelle, Thomas Winters, Bettina Berendt arXiv read paper