Posted on 2024-09-11

European Tweeties

Creating language models for all 24 EU languages



Existing language models are either focused on English or multilingual, with reasonable performance in the bigger languages. However, many languages are not supported or perform a lot worse than when prompted in English. To address this, we are creating 23 new language models for all EU languages.


For our Trans-tokenization project, we created the Tweety-7B language models for Dutch, Tatar, Italian (8B from Llama 3) and and Armenian. However, we also wanted to train an LLM for all European languages.

Base model

To start training our models, we have to make some choices. We initially used Llama 3 8B and Mistral 7B for the Dutch and Italian models. Since we do not have unlimited compute and smaller models are also often more practical to run, we decided to focus on trans-tokenizing gemma-2-2b. The nice thing is that this model also has larger versions with the same tokenizer (unlike Phi 3) and we could train a bigger one for some languages without redoing the conversion. And a 2B model is also easily trainable (for someone without a lot of compute). So gemma-2-2b it is.

Tokenizer

We also need language-specific tokenizers, but creating decent tokenizers is still a bit of a challenge. Using our experiences with tokenizers, we create monolingual SentencePiece tokenizers on the NLLB dataset dataset, so with some noise. Every tokenizer has around 20k token types, which is a lot lower than most multilingual models (e.g. gemma has 250k tokens) but we only care about one language at a time. So 20k is sufficient and keeps the embeddings matrix small.

Training

There are two training strategies after applying our Trans-Tokenization. We can train the converted model on a new monolingual dataset and get a decent monolingual model out, or we can train the model with all the mapped heads. The first one is what we have done for Italian and Dutch and what I am doing here for all EU languages. The second one is more complicated but it might give us a (good?) multilingual model. That's future work; we focus on training good monolingual models for now. So one model at a time.

Evaluation

As we found out with the Dutch and Italian models, training a language model is nice and a bit of an engineering challenge. But then you have a bunch of weights and no clue how good they are. For English, there are a many benchmarks collected in the LM evaluation harness, but for other languages we don't have that. There are some forks for one or multiple languages, but they do have some drawbacks. The two main ones that I found are coincedenly mostly ran by German researchers:

  • Occiglot: Fork of the harness with 6 languages (German, French, Italian, Spanish, English and Dutch). The code quality is meh, with no documentation and the original readme just copied from the original harness.
  • OpenGPT-X: Also a fork, but supports 21 out of the EU's 24 languages. However, even though the code for their lm harness is available (with the same issues as Occiglot's), it is completely useless since the datasets are private: https://huggingface.co/openGPT-X. Shame.
  • GermanBench: Only for German and it seems a bit and old fork, but it works. At least I can evaluate the german Model.

Linked publications

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP 2024 François Remy, Pieter Delobelle, Hayastan Avetisyan, Alfiya Khabibullina, Miryam de Lhoneux, Thomas Demeester First Conference on Language Modeling read paper