Tokens are what makes language models understand language. Each sentence gets split into tokens and then converted to embeddings. So we want good tokens that cover as much of a language's words as possible, with our limited vocabulary. Luckily there are many libraries, like SentencePiece. However, the configuration is not trivial to get decent results on noisy data.

With RobBERT and more recently with Tweety-7B-dutch, I have created quite a few tokenizers. I typically choose BPE over WordPiece, but the packages used to create them varied a bit. For our latest Tweety models, we worked with SentencePiece as this gives decent BPE tokenizers, but the library is only a tiny wrapper with non-so-sensible defaults for many cases.

When dealing with scraped data, there is often some noise present. This affects the tokenizer, since it might create merges based on words or characters from other languages. But these merges are not really useful, since they do not lead to words in the language that we care about. In addition, they take up valuable embedding space, since every merge leads to a new token type.

This blog post is not intended to be a deep dive into tokenization, for that I refer to the thesis of a student of mine, Thomas Bauwens. Instead, I give some tips on how to set up a half-decent BPE tokenizer with SentencePiece. I focus on Dutch in this blog post, since that is a language I understand, but the tips are generally applicable and I also use them for other languages.

I use the allenai/nllb dataset for this blog post, which is a bit noisy.

First attempt: total coverage

To get started, I use the following script that just iterates over a dataset and trains a SentencePiece model. The vocab size is 22k, which is a bit arbitrary, but since I will have a few vocabs for different languages, I want to keep them small.

spm.SentencePieceTrainer.train(
        sentence_iterator=dataset_iterator,
        model_writer=model,
        vocab_size=vocab_size,
        character_coverage=1.0,
        model_type='bpe',
        split_digits=True,
        allow_whitespace_only_pieces=True,
        num_threads=24,
        max_sentence_length=300000,
        train_extremely_large_corpus=True,
        byte_fallback=True,
        accept_language=languages,
        unk_piece='<unk>',
        bos_piece='<s>',
        eos_piece='</s>',
        pad_piece='<pad>',
        unk_id=0,
        bos_id=1,
        eos_id=2,
        pad_id=3,
        user_defined_symbols="▁▁,▁▁▁,▁▁▁▁,▁▁▁▁▁▁▁▁,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁,--,----,-----,--------,----------------,++,/**,***,****,******,********,**/,##,###,<|im_start|>,<|im_end|>,<|system|>,<|user|>,<|assistant|>,▁—,▁“,“,”,’,—",
    )

The results are not bad and the Dutch merges make up a good chuck of the tokenizer, but there are some issues. The vocab starts with some control tokens (up to token #34) and then starts introducing all Hex values between 0x00 and 0xFF. Not ideal. Is this an ASCII default that got incorrectly converted? Take a look at the "en" merge resulting in token #290. It merges "e" and "n", but where are those tokens? They are at #16790 and #16791, instead of in the beginning... This is apparently caused by the byte_fallback option.

Another issue is that there are quite a lot of Chinese characters in the vocab, which is that useful useful for Dutch.

"<0x00>": 34,
"<0x01>": 35,
"<0x02>": 36,
"<0x03>": 37,
"<0x04>": 38,
...
"<0xFE>": 288,
"<0xFF>": 289,
"en": 290,
"er": 291,
...
"▁Hebt": 16786,
"▁RO": 16787,
"▁songs": 16788,
"▁": 16789,
"e": 16790,
"n": 16791,
"a": 16792,
...
"国": 17413,
"": 17414,
"🎉": 17415,
"Ù": 17416,
"ľ": 17417,
...
"": 21999

Some evaluation with trans-tokenization shows that 77.7% of the tokens get mapped from an English Llama 3 tokenizer. Not too bad, but 22.3% unmapped tokens—and basically useless embeddings—is still quite a lot.

Second attempt: less coverage and no fallback

We do not really need a byte fallback, so let's remove that. It could be handy for emojis, but most prevalent emojis should be in the vocab anyway by occurring enough.

spm.SentencePieceTrainer.train(
        ...
        character_coverage=0.9995,
        byte_fallback=False,
    )

No more Chinese character, that is good. But now we are actually also missing some tokens that I would like to have, like { and }. These are typically used for json and outputting valid json is actually one of the scandeval benchmarks.

Third attempt: defining required tokens

We can define some tokens that we want to have in the vocab. This is a bit of a hack, since we hope that our tokenizer will actually learn these. But we do need json support, so what can you do 🤷‍♂️.

spm.SentencePieceTrainer.train(
        ...
        user_defined_symbols="▁▁,▁▁▁,▁▁▁▁,▁▁▁▁▁▁▁▁,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁,--,----,-----,--------,----------------,++,/**,***,****,******,********,**/,##,###,<|im_start|>,<|im_end|>,<|system|>,<|user|>,<|assistant|>,▁—,▁“,“,”,’,—,{,}\",{\",\"}",
    )

Ok, perhaps enabling byte fallback is not too bad, especially if we reduce the coverage as advised by Kudo himself

In any case, the resulting tokenizer is reasonably decent and for reference: our trans-tokenization can now map 99.8% of the tokens.

European tokenizers

If you just care about some decent tokenizers for EU languages, we are creating and releasing a set combined with Trans-Tokenized LLM heads. You can follow the process here.

blogpost

Setting up a decent SentencePiece tokenizer

Reasonable monolingual tokenization from noisy data

First attempt: total coverage

Second attempt: less coverage and no fallback

Third attempt: defining required tokens

European tokenizers

Linked publications