Blog

Less polished, a bit more opinionated and more emojis, what more do you want from a blog? 🙃
Posted on September 11, 2024

European Tweeties

Creating language models for all 24 EU languages

Existing language models are either focused on English or multilingual, with reasonable performance in the bigger languages. However, many languages are not supported or perform a lot worse than when prompted in English. To address this, we are creating 23 new language models for all EU languages.

Posted on August 30, 2024

Setting up a decent SentencePiece tokenizer

Reasonable monolingual tokenization from noisy data

Tokens are what makes language models understand language. Each sentence gets split into tokens and then converted to embeddings. So we want good tokens that cover as much of a language's words as possible, with our limited vocabulary. Luckily there are many libraries, like SentencePiece. However, the configuration is not trivial to get decent results on noisy data.

Posted on December 27, 2023

Dutch Chat Toolkit

Creating retrieval-augmented chatbots

A lot of NLP technologies are easy to use for beginners, but creating and deploying a chatbot is still a bit tricky. Let's make a Python CLI toolkit to quickly create a chatbot with a web-based user interface.

Posted on November 04, 2023

Building a language learning app

Day 3: Prompting and basic UI

With the current state of transformer models for text and speech, I believe that there is an opportunity to make fully immersive language learning apps that can tailor their content to what the user wants to learn. In this series, I try to work out a demo using different NLP technologies.

Posted on November 01, 2023

Building a language learning app

Day 2: setting up the app

With the current state of transformer models for text and speech, I believe that there is an opportunity to make fully immersive language learning apps that can tailor their content to what the user wants to learn. In this series, I try to work out a demo using different NLP technologies.

Posted on October 30, 2023

Building a language learning app

Day 1: planning

With the current state of transformer models for text and speech, I believe that there is an opportunity to make fully immersive language learning apps that can tailor their content to what the user wants to learn. In this series, I try to work out a demo using different NLP technologies.

Posted on October 24, 2023

Migrating from HuggingFace AdamW

Drop-in replacement optimizer with learning schedule

The AdamW implementation from HuggingFace is deprecated and can even lead to errors. This short blog post suggests a drop-in replacement.

Posted on November 10, 2022

Updating RobBERT (part 2)

Bringing a language model to 2022

In this second blogpost on updating RobBERT, I discuss the training and analyse how well the model performs for old benchmarks and new tasks.

Posted on August 12, 2022

Updating RobBERT

Bringing a language model to 2022

A few things happened since our Dutch language model RobBERT was trained in 2019. In this blogpost, I explore how to update RobBERT efficiently to include these new words and shifting word usages.