Posted on 2024-05-13

Tweety-7b-dutch

A Dutch generative LLM



Most Dutch generative language models start from an English or multilingual model and finetune that, which works well but is not optimal as the tokens are mostly English. We present Tweety-7b-dutch, a Dutch generative language model that is trans-tokenized to use Dutch tokens instead of English. To highlight the benefits of our method, we show that this model outperforms the multilingual and state-of-the-art Dutch generative language models.


We are excited to introduce tweety-7b-dutch a new Dutch language model with a Dutch tokenizer. This model is created with our trans-tokenization method, setting it apart from other Dutch models that are finetuned versions of English or multilingual systems. Check out that blog post for more details on the method.

The difference in tokens between Dutch models and English-first models.
The difference in tokens between Dutch models and English-first models.

The model

Tokenizer

Tweety-7B-Dutch uses a Dutch tokenizer, more specifically the yhavinga/gpt-neo-1.3B-dutch tokenizer, which is created on a Dutch corpus and has 50k tokens matching Dutch words or subwords. This specialized tokenizer offers an improvement in efficiency and quality for several reasons. Primarily, it matches more Dutch subwords, enabling each token to encapsulate a richer array of linguistic information. This also results in approximately 33% fewer tokens needed to process the same amount of text compared to models using more generic tokenizers. This reduction not only speeds up processing but also lowers computational costs significantly. Furthermore, the quality of language understanding and generation is enhanced, as the model deals with tokens that are intrinsically more meaningful and representative for Dutch.

Pre-training data

Tweety-7B-Dutch was pre-trained using yhavinga/mc4_nl_cleaned, a variant of the multilingual C4 dataset specifically cleaned and filtered for Dutch. This dataset is composed of high-quality and unfortunately also low-quality web-scraped content that has been filtered to remove most low-quality and irrelevant material.

The model was trained on a 8.5 billion tokens with a large context window of 8196 tokens, which means that Tweety-7B-Dutch can maintain context over longer passages of text, an essential feature for tasks involving understanding and generating coherent and contextually appropriate language outputs.

Model Versions

Tweety-7B-Dutch has a base version and an instruction-tuned chat version:

  • base: This version is the foundational model that has been trained purely on the Dutch text corpus without any specific tuning for particular tasks.
  • chat: Building upon the base model, this variant has been fine-tuned with a chat template, which tailors it more towards conversational AI and similar interactive applications. This tuning is designed to enhance the model's ability to understand and generate responses based on instructions or queries.

Performance

Perplexity

Tweety-7B-Dutch has achieved a perplexity of 7.7 on the held-out set of the mC4 corpus, demonstrating strong performance in language modeling tasks. This is lower than multilingual models, such as Mistral, although this number cannot be compared directly since the tokenizers are different. The perplexity of _gir + af is usually a bit lower, averaged per token, than the perplexity of _giraf. So to correct for that, we have to adjust for the encoded length. This gives us an equivalent perplexity in Mistral's tokenizer of 5.75. Not too bad and a lot lower than the 7.1 of Mistral-7B-v0.1.

Zero-shot performance and few-shot performance

In addition to its zero-shot capabilities, we evaluate our model on the Dutch benchmark SQuAD-NL in the Dutch evaluation harness. We are still working on a more comprehensive evaluation, but initial results look promising.

Squad-nl performance
Performance of tweety-7b-dutch on SQuAD-NL.

Related models

Tweety-7b-dutch is part of a series of trans-tokenized LLMs:

These models are created using the same approach. Since they are true low-resource languages there is simply no way to create a model for such languages otherwise. We can also swap the native head and the trans-tokenized head for some cross-lingual capabilities, such as zero-shot machine translation. For more details on this, check out the paper.

Tweety Model 1
tweety-7b-dutch
Tweety Model 2
tweety-7b-tatar
Tweety Model 3
tweety-7b-armenian

And we are also working on tweety-7b-italian-v24a

Linked publications

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP 2024 François Remy, Pieter Delobelle, Hayastan Avetisyan, Alfiya Khabibullina, Miryam de Lhoneux, Thomas Demeester First Conference on Language Modeling read paper