RobBERT-2022 is the newest release of the Dutch RobBERT model. Since the original release in January 2020, some things happened and our language evolved. For instance, the COVID-19 pandemic introduced a wide range of new words that were suddenly used daily. To account for this and other changes in language use, we release a new Dutch BERT model trained on data from 2022.
Thanks to this more recent dataset, this model shows increased performance on several tasks related to recent events, e.g. COVID-19-related tasks. We also found that for some tasks that do not contain more recent information than 2019, the original RobBERT model can still outperform this newer one.
Creating RobBERT-2022
I've detailed the development of RobBERT-2022 in two blogposts: part one and part two, but the gist is that we added almost 3k new tokens to our vocabulary and then pre-trained our model based on the old weights.
We evaluated RobBERT-2022 on the same benchmark tasks as RobBERT and on two COVID-19-related tasks and the model performs very well on most tasks.
Usage
RobBERT-2022 is a plug-in replacement for the original RobBERT model, so you can just update your code to use the new identifier in the HuggingFace Transformers library.
from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("DTAI-KULeuven/robbert-2022-dutch-base") model = AutoModelForSequenceClassification.from_pretrained("DTAI-KULeuven/robbert-2022-dutch-base")