Posted on 2024-07-10

Trans-Tokenization

Language Adaptation of LLMs for Low-Resource NLP



we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to adapt a high-resource monolingual LLM to a new target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages.


Accepted at the First Conference on Language Modeling (COLM 2024)

Language models have become fundamental tools, enhancing everything from simple translation services to complex decision-making processes. While models like GPT and BERT have set impressive benchmarks, their primary focus has often been on English or other high-resource languages. This emphasis overlooks a vast spectrum of global languages, each with unique linguistic features and cultural significance. Developing language models for a wider range of languages democratizes access to technology, allowing non-English speakers to benefit from advancements in AI. Our Dutch MLMs, RobBERT and RobBERT-2023, have shown that there is a demand for non-English models and their performance illustrates that language-specific models can achieve high performance while addressing unique linguistic traits.

Moreover, linguistic diversity in AI helps preserve cultural heritage and promotes equitable access to information and technology. By designing models that are capable of understanding and generating text in low-resource languages, we not only expand the reach of AI applications but also ensure that no language community is left behind. This is especially vital for languages that are underrepresented in technology, where the resources are limited, and existing language models often fail.

To address this gap, we introduce a novel model translation approach, Trans-Tokenization, which leverages the strengths of high-resource language models and adapts them to new linguistic landscapes, effectively broadening the scope and utility of language models. We have been working on this for a few years now, first with the release of RobBERT-2023 from a RoBERTa model. Now we also created a Dutch LLM, tweety-7b-dutch, as well as Tatar, Armenian and Italian models using our new approach, and we are excited to finally release the paper as well!

The Challenge of Multilingual Models

One major challenge in developing multilingual models is the lack of high-quality, annotated data for many languages. To illustrate the scale of data needed for pretraining LLMs, while one book may contain 40-50k tokens, a single LLM training set might require up to 6 trillion tokens, equivalent to about 2.5 million bookshelves. This vast requirement makes it clear why languages with less digital presence struggle to compete in quality and performance with those like English, which dominate the digital landscape.

The second challenge stems from the linguistic diversity across languages. There is a trade-off between using a language-specific tokenizer versus adapting an English tokenizer. Training from scratch with a language-specific (e.g. Dutch) tokenizer might ideally preserve the linguistic nuances of that language but at a prohibitive cost. On the other hand, using an English tokenizer, even with methods like fine-tuning or LoRA adaptations, risks generating "Franken-Dutch," where the language model outputs a Dutch-English hybrid that may compromise the authenticity and usability of the generated text. That is not ideal, especially if you want guarantees to only output a certain language.

Especially the representations per token suffer a lot. Take a look at how the sentence 'No, I am not a giraffe! That is an absurd thought.' gets tokenized when we use an English-focused model, like Mistral-7B.

Tokenization of the Dutch sentence '_No, I am not a giraffe! That is an absurd thought._' with an English tokenizer.

Even if words are kind of similar, the tokenization is not. This is a problem for the model to learn the correct representations for each token, since every token gets its own embedding. Especially when tokens are small and only a few letters, the embeddings cannot possibly contain any meaningful representation, so everything relies on the attention layers above to solve this. This wastes precious computational power in the recomposition of meaning from long sequences of small and meaningless tokens.

Compare this to the Dutch tokenizer, where words and meaningful parts of words are tokenized together. This way, the model can learn a decent representation for each token, and the attention layers can focus on the actual meaning of the sentence:

Tokenization of the Dutch sentence '_No, I am not a giraffe! That is an absurd thought._' with a Dutch tokenizer.

Trans-tokenization

Our Trans-Tokenization method designed to bridge this gap between high-resource and low-resource languages. The method leverages the work done on existing large language models pre-trained in high-resource languages, such as English, and adapts them for use with languages that have fewer resources. This is achieved through a cross-lingual embedding initialization, where token embeddings from a target language are initialized using a weighted average of semantically similar token embeddings from the source language. Concretely, we follow these steps:

  1. Token Alignment: The first step involves aligning tokens from the source language (e.g., English) with tokens from the target language (e.g., Dutch). This alignment is based on statistical machine translation between tokens in parallel corpora, ensuring that the tokens mapped across languages carry similar meanings and contextual uses. For that, we use a slightly modified version of Fast Align to obtain token alignments.
  2. Embedding Mapping: Once tokens are aligned, the embeddings of the target language tokens are initialized. This initialization is not random but based on the embeddings of their aligned counterparts in the source language. For the details we refer to the paper, but the idea is that we use the Fast Align counts to weigh the embeddings of the source language tokens. This way, we can ensure that the embeddings are capturing the full meaning of a token.
  3. Model Adaptation: The final step involves adapting the newly initialized model to by pre-training on the new language's data. We found that a limited amount of pre-training (e.g. 40 GPU-hours) was sufficient to achieve competitive performance. This is especially needed when the LM head and the embeddings are not tied, but in both cases the generation quality still increases.

Available Models

We developed of several language-specific models, either for this paper or in collaboration with other groups. Here is a summary of the models and the key results:

  • Dutch: This model shows a strong performance in language modeling tasks, with a perplexity of 7.7 on the held-out set of the mC4 corpus and strong scores on SQuAD-NL. Read more about it on the blog post.

  • Armenian: Our Armenian model demonstrates robust performance in text generation and summarization tasks, showcasing its ability to adapt to the unique Armenian script and linguistic structure effectively.

  • Tatar: Our Tatar model, developed for a significantly under-resourced language, achieved state-of-the-art results in machine translation tasks, especially in zero-shot scenarios where no direct training data was available.

  • Italian: The RiTA team adapted Mistral and Llama 3 to Italian. Those models show encouraging results after training on as little as 5G Italian tokens. Aside from this, the team also presents an evaluation resource (ItaEval) to evaluate these models! Check out their paper here.

Hydra LLMs

Hydra LLMs are named for their ability to support multiple 'heads', each capable of processing different languages. This architecture allows the models to switch seamlessly between languages, making them highly efficient for tasks that require multilingual capabilities, as we show by outperforming Google Translate without explicit training on translation.

Each language supported by Hydra LLMs has its own dedicated embeddings and a language-specific tokenizer. Since the tokens share the same embedding space, irrespective of the language, the model can leverage the shared knowledge across languages. As a consequence, the model can switch between different language heads depending on the input language. Despite having multiple language heads, Hydra LLMs share common underlying attention layers, which helps in leveraging shared knowledge across languages and allows the model to switch between languages in the input and output.

In our experiments, we demonstrated the practical applications and benefits of Hydra LLMs with zero-shot translation: Hydra LLMs achieved state-of-the-art performance in zero-shot translation tasks. For example, the model trained with English and Tatar not only excelled in direct translation tasks but also showed remarkable ability in translating between languages not directly trained together, such as Russian to Tatar.

Zero-shot translation results for Hydra LLMs. Some knowledge of Tatar is useful, but a reference translation is provided at the top.

Conclusion

By leveraging existing high-resource language models, our Trans-Tokenization approach enables the adaptation of these high-resource models to new languages with significantly reduced computational costs and time, all while maintaining a language-specific tokenizer. We showed that this method can be used to create high-quality language models for low-resource languages, such as Dutch, Armenian, and Tatar, with minimal pre-training data. The resulting models demonstrated competitive performance across various NLP tasks.

Our development of Hydra LLMs further exemplifies the practical application of this methodology, enabling a single model framework to process multiple languages in parallel. This architecture was particularly effective in zero-shot translation tasks, where it managed to outperform existing models despite the minimal direct training on target languages. These results underline the potential of multi-head models in managing different tokenizer's within a unified system, offering a path toward more versatile and economically viable NLP solutions.

By publishing these findings, we hope to encourage further research into model adaptation techniques and monolingual model deployment, especially for languages with limited resources.

Acknowledgements

This research has been funded by the Flanders AI Research Programme and FWO. We would like to thank the KU Leuven, the Flemish Supercomputer Center (VSC) and UGent for their support and collaboration.

flanders.ai
fwo
kuleuven
UGent

Linked publications

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP 2024 François Remy, Pieter Delobelle, Hayastan Avetisyan, Alfiya Khabibullina, Miryam de Lhoneux, Thomas Demeester First Conference on Language Modeling read paper