RobBERT-2022 is the newest release of the Dutch RobBERT model. Since the original release in January 2020, some things happened and our language evolved. For instance, the COVID-19 pandemic introduced a wide range of new words that were suddenly used daily. To account for this and other changes in language use, we release a new Dutch BERT model trained on data from 2022.
Thanks to this more recent dataset, this model shows increased performance on several tasks related to recent events, e.g. COVID-19-related tasks. We also found that for some tasks that do not contain more recent information than 2019, the original RobBERT model can still outperform this newer one.
I've detailed the development of RobBERT-2022 in two blogposts: part one and part two, but the gist is that we added almost 3k new tokens to our vocabulary and then pre-trained our model based on the old weights.
We evaluated RobBERT-2022 on the same benchmark tasks as RobBERT and on two COVID-19-related tasks and the model performs very well on most tasks.
RobBERT-2022 is a plug-in replacement for the original RobBERT model, so you can just update your code to use the new identifier in the HuggingFace Transformers library.
from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("DTAI-KULeuven/robbert-2022-dutch-base") model = AutoModelForSequenceClassification.from_pretrained("DTAI-KULeuven/robbert-2022-dutch-base")