For our Trans-tokenization project, we created the Tweety-7B language models for Dutch, Tatar, Italian (8B from Llama 3) and and Armenian. However, we also wanted to train an LLM for all European languages.
Base model
To start training our models, we have to make some choices. We
initially used Llama 3 8B and Mistral 7B for the Dutch and Italian
models. Since we do not have unlimited compute and smaller models are
also often more practical to run, we decided to focus on
trans-tokenizing gemma-2-2b
. The nice thing is that this
model also has larger versions with the same tokenizer (unlike Phi 3)
and we could train a bigger one for some languages without redoing the
conversion. And a 2B model is also easily trainable (for someone without
a lot of compute). So gemma-2-2b
it is.
Tokenizer
We also need language-specific tokenizers, but creating decent tokenizers is still a bit of a challenge. Using our experiences with tokenizers, we create monolingual SentencePiece tokenizers on the NLLB dataset dataset, so with some noise. Every tokenizer has around 20k token types, which is a lot lower than most multilingual models (e.g. gemma has 250k tokens) but we only care about one language at a time. So 20k is sufficient and keeps the embeddings matrix small.
Training
There are two training strategies after applying our Trans-Tokenization. We can train the converted model on a new monolingual dataset and get a decent monolingual model out, or we can train the model with all the mapped heads. The first one is what we have done for Italian and Dutch and what I am doing here for all EU languages. The second one is more complicated but it might give us a (good?) multilingual model. That's future work; we focus on training good monolingual models for now. So one model at a time.
Evaluation
As we found out with the Dutch and Italian models, training a language model is nice and a bit of an engineering challenge. But then you have a bunch of weights and no clue how good they are. For English, there are a many benchmarks collected in the LM evaluation harness, but for other languages we don't have that. There are some forks for one or multiple languages, but they do have some drawbacks. The two main ones that I found are coincedenly mostly ran by German researchers:
- Occiglot: Fork of the harness with 6 languages (German, French, Italian, Spanish, English and Dutch). The code quality is meh, with no documentation and the original readme just copied from the original harness.
- OpenGPT-X: Also a fork, but supports 21 out of the EU's 24 languages. However, even though the code for their lm harness is available (with the same issues as Occiglot's), it is completely useless since the datasets are private: https://huggingface.co/openGPT-X. Shame.
- GermanBench: Only for German and it seems a bit and old fork, but it works. At least I can evaluate the german Model.