RobBERT is the state-of-the-art Dutch BERT model. It is a large pre-trained general Dutch language model that can be fine-tuned on a given dataset to perform any text classification, regression or token-tagging task. As such, it has been successfully used by many researchers and practitioners for achieving state-of-the-art performance for a wide range of Dutch natural language processing tasks.

Models
Get started
Intro
Tasks
Wrapping up
Demo
Name

Meet RobBERT, the state-of-the-art Dutch BERT model that effortlessly tackles a diverse range of language tasks. Trained on hundreds of millions of Dutch sentences, RobBERT excels in text classification, regression, and token-tagging with unparalleled accuracy. Whether it's predicting sentiment in book reviews with 94% precision or filling in missing words like "die" or "dat" with a remarkable 98% success rate, RobBERT is your go-to solution for nuanced language processing.

RobBERT has become the trusted choice for researchers and practitioners in Dutch natural language processing for achieving state-of-the-art performance. RobBERT's performance has been demonstrated on:

Emotion detection
Sentiment analysis (book reviews, news articles*)
Coreference resolution
Named entity recognition (CoNLL, job titles*, SoNaR)
Part-of-speech tagging (Small UD Lassy, CGN)
Zero-shot word prediction
Humor detection
Cyberbulling detection
Correcting dt-spelling mistakes*
Natural language inference*
Review classification*

To use the RobBERT model using HuggingFace transformers, use the name pdelobelle/robbert-v2-dutch-base.

RobBERT-2023

RobBERTje

RobBERT

RobBERT-2023 is the newest release of our RobBERT model. It has a new tokenizer and we release both large and base-sized models.

Smaller, but still capable variants of RobBERT made with knowledge distillation from the original RobBERT model. We release 4 variants of different sizes for easier use and faster inference.

RobBERT-v2 is our original pre-trained general Dutch language model that can be fine-tuned on a given dataset to perform any text classification, regression or token-tagging task.

Size

Number of parameters

117M and 355M

40M-74M

110M

DUMB benchmark

Improvement over BERTje

+18.6

with 355M params

/

Not evaluated

+1.6

Sentiment analysis

Dutch book reviews

94.7

92.9

93.2

Text understanding

SICK-NL

89.3

83.4

84.2

Releases

robbert-2023-dutch-large robbert-2023-dutch-base

robbertje-1-gb-shuffled robbertje-1-gb-merged robbertje-1-gb-non-shuffled robbertje-1-gb-bort

robbert-v2-dutch-base

Get started with our models

You can use all our RobBERT models with the Hugging Face transformers library.

from transformers import RobertaTokenizer, RobertaForSequenceClassification
tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robbert-v2-dutch-base")
model = RobertaForSequenceClassification.from_pretrained("pdelobelle/robbert-v2-dutch-base")

Intro in RobBERT, BERT and RoBERTa

The advent of neural networks in natural language processing (NLP) has significantly improved state-of-the-art results within the field. While recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) initially dominated the field, recent models started incorporating attention mechanisms and then later dropped the recurrent part and just kept the attention mechanisms in so-called transformer models. This latter type of model caused a new revolution in NLP and led to popular language models like GPT-2 and ELMo. BERT improved over previous transformer models and recurrent networks by allowing the system to learn from input text in a bidirectional way, rather than only from left-to-right or the other way around. This model was later re-implemented, critically evaluated and improved in the RoBERTa model.

These large-scale transformer models provide the advantage of being able to solve NLP tasks by having a common, expensive pre-training phase, followed by a smaller fine-tuning phase. The pre-training happens in an unsupervised way by providing large corpora of text in the desired language. The second phase only needs a relatively small annotated data set for fine-tuning to outperform previous popular approaches in one of a large number of possible language tasks.

While language models are usually trained on English data, some multilingual models also exist. These are usually trained on a large quantity of text in different languages. For example, Multilingual-BERT is trained on a collection of corpora in 104 different languages and generalizes language components well across languages. However, models trained on data from one specific language usually improve the performance over multilingual models for this particular language. Training a RoBERTa model on a Dutch dataset thus has a lot of potential for increasing performance for many downstream Dutch NLP tasks.

Illustration of transformer models and how the encoder and decoder stacks are part of it.

In NLP, encoder-decoder models have been used for some time. These models, often called sequence-to-sequence or seq2seq, are good at various sequence-based tasks: translations, token labeling, named entity recognition (NER), etc. Historically, these seq2seq models were usually LSTMs or other recurrent networks. A major improvement in these networks was an attention mechanism, that allowed to communicate more than one feature vector. (For those coming from computer vision, this looks a bit like the connections in UNet).

The by now famous transformer model was based solely on this attention mechanism. It features 2 stacks: (i) an encoder stack that uses multiple layers of self-attention and (ii) a decoder stack with attention layers that connect back to the encoder outputs.

Encoders

So this attention-based encoder generates a transformed representation of the input sequence at once.

Illustration of the encoder in RobBERT used for language modeling, by predicting the most likely word.

We could also interpret this probabilistically, we have a language model

$$P(\text{``giraf"} \mid \text{``ik zie een <mask> in mijn tuin."})<0.0001$$

Or a more probable:

$$P(\text{``boom"} \mid \text{ik zie een <mask> in mijn tuin."})=0.1498$$

In fact, we can even query the most likely results. For this sentence, RobBERT gives us:

[('Ik zie een lamp in mijn tuin.', 0.39584335684776306, ' lamp'),
 ('Ik zie een boom in mijn tuin.', 0.1497979462146759, ' boom'),
 ('Ik zie een camera in mijn tuin.', 0.089895099401474, ' camera'),
 ('Ik zie een ster in mijn tuin.', 0.046020057052373886, ' ster'),
 ('Ik zie een stip in mijn tuin.', 0.009481011889874935, ' stip'),
 ('Ik zie een man in mijn tuin.', 0.009198895655572414, ' man'),
 ('Ik zie een slang in mijn tuin.', 0.009129301644861698, ' slang'),
 ('Ik zie een stem in mijn tuin.', 0.007939961738884449, ' stem'),
 ('Ik zie een bos in mijn tuin.', 0.007785357069224119, ' bos'),
 ('Ik zie een storm in mijn tuin.', 0.0077188946306705475, ' storm')]

Ok, I'll be honest. I thought a tree (een boom) would be the most likely, I didn't even think of a lamp (een lamp). But I guess it makes sense anyway. And giraffes are not even in the top 10k suggestions, so that's disappointing.

Decoders

So that’s the encoder side. For some language tasks, it is enough (NER, POS tagging, etc.). But for others, like translation, we need the decoder as well. Translation could thus be formulated as a task $P(A) P(B\mid A)$ with a decoder that depends on the outputs of the encoder or language model. Practically, this looks a bit like this:

$$P(\text{``Ik zie een giraf in mijn tuin"})\newline \cdot P(\text{``I see a giraffe in my garden"}\mid \text{``Ik zie een giraf in mijn tuin"})$$

To actually get to this outcome, the attention mechanism implicitly uses marginal probabilities over all tokens in the encoder, which are then also used by the decoder. It’s also possible to use only the decoder, which is what GPT-2 does: they can generate an output sequence based on only the input without an encoder. But of course, GPT-2 doesn’t translate sentences, as it misses that part.

$$P(\text{``I see a giraffe"}|\text{``I see a"})=0.01$$

$$P(\text{``I see a giraffe in"}|\text{``I see a giraffe"})=0.6$$

But since RobBERT is only an encoder stack, we won’t dive deeper into this. From now on, we will only describe an encoder we happen to have laying around (hint: it’s RobBERT).

Great, but how do you input a sentence?

Word embeddings like word2vec were quite popular to use in seq2seq models. Especially since they happened to encode some semantic and grammatical meaning, so they were a quick way to boost the utility. But these models had two drawbacks: (i) if a word is not in the vocabulary, it has no vector (word2vec trained on Google News had 3 million words so usually it was fine) and (ii) word embeddings give the same vector regardless of context. The go-to example is “stick”, that is both a verb and a noun.

This was addressed by ELMo, which could generate contextualized embeddings. But BERT and RobBERT have a similar trick up their sleeves. If we look back at the probabilistic interpretation, we see that the input of the language model is not just a bag-of-words or TF-IDF input, but the actual text with one word masked. The input is namely the whole sentence, or even multiple sentences.

To deal with all these words, we need something to convert words into an embedding. All transformer-based models—including RobBERT—use a tokenizer. This tokenizer splits the input into words and, if a word is not in its vocabulary, it will split it into subwords. BERT uses WordPiece and RobBERT uses a byte-level BPE tokenizer. These these tokenizers have the benefit that all input sentences can be represented, even if some words are missing. Yay, no more out-of-vocabulary (OOV) issues!

For all these tokens, we take a vector (similar to word embeddings) from our embeddings matrix. Hold up, wasn’t that problematic for word2vec? Yes, but transformer-based models do something else before the attention heads: they multiply the embeddings with a positional encoding, which is a sine and cosine concatenated. These altered embeddings then get fed into the first layer of attention heads.

Self-attention mechanism

So we have multiple layers (12 in the case of RobBERT) that each have multiple attention heads (also 12). Each head calculates scaled dot-product attention based on Query (Q), Key (K) and Value (V) vectors, where these vectors are calculated from three respective weight matrices that are learned during training.

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

After calculating the self-attention for each head, the outputs are fed into a linear layer and then concatinated.

If you want to learn more about the attention mechanism specifically, there are some good resources, like this post from Towards Data Science and Jay Alammar's The Illustrated Transformer!

Pre-training and finetuning

As it happens, language models are quite expensive to train. We used a high-performance computing cluster with ~80 Nvidia P100’s for several days. This is because we train it on a large dataset (39 GB of text) with the word-masking objective, which means we randomly replace some words by a <mask> token (or another word or the same word, but those are less likely). After a few epochs, we have something that resembles the probabilistic language model we described earlier.

But this language model can do more than just filling in some <mask> tokens! This is one of the so-called heads, the one we use to pre-train our language model on. After training, we can easily take it off—perhaps calling them heads was a bit insensitive?—and replace it another one. This step is then called finetuning. All the weights of the model stay the same and we add a newly initialized head that we train on the data that we want. And since most weights are from the trained base model, we only need a fraction of the data. So it will go a lot faster as well!

Custom heads: sentence and token classification

So to train our model, we used a language modeling head. But as it turns out, we could also add other heads that do other things. These heads can be trained on a lot less data and are faster to train, so it's really easy to train your own on a custom task.

Token-level classification (left) and sentence-level classification (right) for an example input.

Of course, not all tasks are the same. Roughly, there are kinds of tasks: (i) sentence-level prediction and (ii) token-level prediction. As the names imply, the difference is that one predicts something on a sentence or document level versus making a prediction for each token.

If you’re interested in the custom heads we have trained and their performance, see Downstream tasks.

Wrapping it up and why monolingual models matter

So we discussed language models—and their probabilistic interpretation—and how they are related to transformer models. With a lot of self-supervised training on large corpora, these models are relatively good for these tasks. Multilingual models also perform very well, but they do mix a lot of languages (over 100 for Google's mBERT).

As a language model, a multilingual model will then have to deal with very different linguistic properties. For the tokenizer, this means either a lot more tokens, or less tokens that represent actual words in one language. Despite these drawbacks, multilingual models might leverage some features from related languages, which could increase performance.

This is also what we observed, multilingual models do perform well on a variety of tasks, especially if sufficient training data is available. If this is not the case, monolingual models like RobBERT have a slight edge.

Downstream tasks

Anaphora resolution with die and dat

We evaluated RobBERT's performance on a task that is specific to Dutch, namely disambiguating "die" and "dat" (= "that" in English). In Dutch, depending on the sentence, both terms can be either demonstrative or relative pronouns; in addition they can also be used in a subordinating conjunction, i.e. to introduce a clause. The use of either of these words depends on the gender of the word it refers to. Distinguishing these words is a task introduced by Allein et al. (2020), who presented multiple models trained on the Europarl and SoNaR corpora. Their results ranged from an accuracy of 75.03% on Europarl to 84.56% on SoNaR.

For this task, we use the Dutch version of the Europarl corpus, which we split in 1.3M utterances for training, 319k for validation, and 399k for testing. We then process every sentence by checking if it contains "die" and "dat", and if so, add a training example for every occurrence of this word in the sentence, where a single occurrence is masked. For the test set for example, this resulted in about 289k masked sentences. We then test two different approaches for solving this task on this dataset. The first approach is making the BERT models use their MLM task and guess which word should be filled in this spot, and check if it has more confidence in either "die" and "dat" (by checking the first 2,048 guesses at most, as this seemed sufficiently large). This allows us to compare the zero-shot BERT models, i.e. without any fine-tuning after pre-training.. The second approach uses the same data, but creates two sentences by filling in the mask with both "die" and "dat", appending both with the <sep> token and making the model predict which of the two sentences is correct.

Demo on disambiguating 'die' and 'dat'

With our language model, you can easily disambiguate between 'die' and 'dat' in Dutch. Just enter a sentence in Dutch, and our model will automatically identify which word should be used.

We use a distilled model for this demo, since it is cheaper to deploy and still has pretty good performance. The model is not finetuned, but a zero-shot model, meaning we never had to train it explicitly on this task. The model also takes casing into account.

Or try an example: Dat is mijn fout, maar dat fout kan ik niet meer aanpassen.

High-level sentiment analysis

we compare its performance with other BERT-models and state-of-the-art systems in sentiment analysis, to show its performance for classification tasks. We replicated the high-level sentiment analysis task used to evaluate BERTje to be able to compare our methods. This task uses a dataset called Dutch Book Reviews Dataset (DBRD), in which book reviews scraped from hebban.nl are labeled as positive or negative. Although the dataset contains 118,516 reviews, only 22,252 of these reviews are actually labeled as positive or negative.

We trained our model for 2000 iterations with a batch size of 128 and a warm-up of 500 iterations, reaching a learning rate of 10⁻⁵. We found that our model performed better when trained on the last part of the book reviews than on the first part. This is likely due to this part containing concluding remarks summarizing the overall sentiment.

Finally, a note on the name and the logo

We named our model RobBERT, or to be more precise: our model named itself RobBERT. When we used word masking in a sentence to introduce itself, it picked RobBERT as the most likely name. In an serendipitous way, this also highlighted the link to RoBERTa, so that name was perfect!

The word rob also means seal in Dutch, hence our logo is a seal being dressed up as Bert. Special thanks to Thomas Winters for the logo!

paper

RobBERT

A Dutch RoBERTa-based Language Model

RobBERT-2023

RobBERTje

RobBERT

117M and 355M

40M-74M

110M

+18.6

/

+1.6

94.7

92.9

93.2

89.3

83.4

84.2

Get started with our models

Intro in RobBERT, BERT and RoBERTa

Encoders

Decoders

Great, but how do you input a sentence?

Self-attention mechanism

Pre-training and finetuning

Custom heads: sentence and token classification

Wrapping it up and why monolingual models matter

Downstream tasks

Anaphora resolution with die and dat

Demo on disambiguating 'die' and 'dat'

High-level sentiment analysis

Finally, a note on the name and the logo

Linked publications