We update the RobBERT Dutch language model to include new high-frequent tokens present in the latest Dutch OSCAR corpus from 2022. We then pre-train the RobBERT model using this dataset. Our new model is a plug-in replacement for RobBERT and results in a significant performance increase for certain language tasks.

RobBERT-2022 is the newest release of the Dutch RobBERT model. Since the original release in January 2020, some things happened and our language evolved. For instance, the COVID-19 pandemic introduced a wide range of new words that were suddenly used daily. To account for this and other changes in language use, we release a new Dutch BERT model trained on data from 2022.

Thanks to this more recent dataset, this model shows increased performance on several tasks related to recent events, e.g. COVID-19-related tasks. We also found that for some tasks that do not contain more recent information than 2019, the original RobBERT model can still outperform this newer one.

Creating RobBERT-2022

I've detailed the development of RobBERT-2022 in two blogposts: part one and part two, but the gist is that we added almost 3k new tokens to our vocabulary and then pre-trained our model based on the old weights.

We evaluated RobBERT-2022 on the same benchmark tasks as RobBERT and on two COVID-19-related tasks and the model performs very well on most tasks.

Usage

RobBERT-2022 is a plug-in replacement for the original RobBERT model, so you can just update your code to use the new identifier in the HuggingFace Transformers library.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("DTAI-KULeuven/robbert-2022-dutch-base")
model = AutoModelForSequenceClassification.from_pretrained("DTAI-KULeuven/robbert-2022-dutch-base")

paper

RobBERT-2022

Updating a Dutch Language Model to Account for Evolving Language Use

Creating RobBERT-2022

Usage

Linked publications