With RobBERT and more recently with Tweety-7B-dutch, I have created quite a few tokenizers. I typically choose BPE over WordPiece, but the packages used to create them varied a bit. For our latest Tweety models, we worked with SentencePiece as this gives decent BPE tokenizers, but the library is only a tiny wrapper with non-so-sensible defaults for many cases.
When dealing with scraped data, there is often some noise present. This affects the tokenizer, since it might create merges based on words or characters from other languages. But these merges are not really useful, since they do not lead to words in the language that we care about. In addition, they take up valuable embedding space, since every merge leads to a new token type.
This blog post is not intended to be a deep dive into tokenization, for that I refer to the thesis of a student of mine, Thomas Bauwens. Instead, I give some tips on how to set up a half-decent BPE tokenizer with SentencePiece. I focus on Dutch in this blog post, since that is a language I understand, but the tips are generally applicable and I also use them for other languages.
I use the allenai/nllb
dataset for this blog post, which
is a bit noisy.
First attempt: total coverage
To get started, I use the following script that just iterates over a dataset and trains a SentencePiece model. The vocab size is 22k, which is a bit arbitrary, but since I will have a few vocabs for different languages, I want to keep them small.
spm.SentencePieceTrainer.train(=dataset_iterator,
sentence_iterator=model,
model_writer=vocab_size,
vocab_size=1.0,
character_coverage='bpe',
model_type=True,
split_digits=True,
allow_whitespace_only_pieces=24,
num_threads=300000,
max_sentence_length=True,
train_extremely_large_corpus=True,
byte_fallback=languages,
accept_language='<unk>',
unk_piece='<s>',
bos_piece='</s>',
eos_piece='<pad>',
pad_piece=0,
unk_id=1,
bos_id=2,
eos_id=3,
pad_id="▁▁,▁▁▁,▁▁▁▁,▁▁▁▁▁▁▁▁,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁,--,----,-----,--------,----------------,++,/**,***,****,******,********,**/,##,###,<|im_start|>,<|im_end|>,<|system|>,<|user|>,<|assistant|>,▁—,▁“,“,”,’,—",
user_defined_symbols )
The results are not bad and the Dutch merges make up a good chuck of
the tokenizer, but there are some issues. The vocab starts with some
control tokens (up to token #34) and then starts introducing all Hex
values between 0x00 and 0xFF. Not ideal. Is this an ASCII default that
got incorrectly converted? Take a look at the "en" merge resulting in
token #290. It merges "e" and "n", but where are those tokens? They are
at #16790 and #16791, instead of in the beginning... This is apparently
caused by
the byte_fallback
option.
Another issue is that there are quite a lot of Chinese characters in the vocab, which is that useful useful for Dutch.
"<0x00>": 34,
"<0x01>": 35,
"<0x02>": 36,
"<0x03>": 37,
"<0x04>": 38,
...
"<0xFE>": 288,
"<0xFF>": 289,
"en": 290,
"er": 291,
...
"▁Hebt": 16786,
"▁RO": 16787,
"▁songs": 16788,
"▁": 16789,
"e": 16790,
"n": 16791,
"a": 16792,
...
"国": 17413,
"": 17414,
"🎉": 17415,
"Ù": 17416,
"ľ": 17417,
...
"": 21999
Some evaluation with trans-tokenization shows that 77.7% of the tokens get mapped from an English Llama 3 tokenizer. Not too bad, but 22.3% unmapped tokens—and basically useless embeddings—is still quite a lot.
Second attempt: less coverage and no fallback
We do not really need a byte fallback, so let's remove that. It could be handy for emojis, but most prevalent emojis should be in the vocab anyway by occurring enough.
spm.SentencePieceTrainer.train(
...=0.9995,
character_coverage=False,
byte_fallback )
No more Chinese character, that is good. But now we are actually also
missing some tokens that I would like to have, like {
and
}
. These are typically used for json and outputting valid
json is actually one of the scandeval
benchmarks.
Third attempt: defining required tokens
We can define some tokens that we want to have in the vocab. This is a bit of a hack, since we hope that our tokenizer will actually learn these. But we do need json support, so what can you do 🤷♂️.
spm.SentencePieceTrainer.train(
...="▁▁,▁▁▁,▁▁▁▁,▁▁▁▁▁▁▁▁,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁,--,----,-----,--------,----------------,++,/**,***,****,******,********,**/,##,###,<|im_start|>,<|im_end|>,<|system|>,<|user|>,<|assistant|>,▁—,▁“,“,”,’,—,{,}\",{\",\"}",
user_defined_symbols )
Ok, perhaps enabling byte fallback is not too bad, especially if we reduce the coverage as advised by Kudo himself
In any case, the resulting tokenizer is reasonably decent and for reference: our trans-tokenization can now map 99.8% of the tokens.
European tokenizers
If you just care about some decent tokenizers for EU languages, we are creating and releasing a set combined with Trans-Tokenized LLM heads. You can follow the process here.