<?xml version='1.0' encoding='UTF-8'?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0"><channel><title>Research blog of Pieter Delobelle</title><link>https://pieter.ai//test.atom</link><description>Research blog of Pieter Delobelle on pretraining, tokenization &amp; safety.</description><atom:link href="https://pieter.ai//test.atom" rel="self"/><docs>http://www.rssboard.org/rss-specification</docs><generator>python-feedgen</generator><image><url></url><title>Research blog of Pieter Delobelle</title><link>https://pieter.ai//test.atom</link></image><language>en</language><lastBuildDate>Wed, 06 May 2026 07:36:18 +0000</lastBuildDate><item><title>RobBERT</title><link>https://pieter.ai/robbert/</link><description>RobBERT is the state-of-the-art Dutch BERT model. It is a large pre-trained general Dutch language model that can be fine-tuned on a given dataset to perform any text classification, regression or token-tagging task. As such, it has been successfully used by many researchers and practitioners for achieving state-of-the-art performance for a wide range of Dutch natural language processing tasks.</description><guid isPermaLink="false">https://pieter.ai/robbert/</guid><pubDate>Mon, 20 Jan 2020 00:00:00 +0000</pubDate></item><item><title>Tweety-7b-dutch</title><link>https://pieter.ai/tweety-7b-dutch/</link><description>Most Dutch generative language models start from an English or multilingual model and finetune that, which works well but is not optimal as the tokens are mostly English. We present Tweety-7b-dutch, a Dutch generative language model that is &lt;i&gt;trans-tokenized&lt;/i&gt; to use Dutch tokens instead of English. To highlight the benefits of our method, we show that this model outperforms the multilingual and state-of-the-art Dutch generative language models.</description><guid isPermaLink="false">https://pieter.ai/tweety-7b-dutch/</guid><pubDate>Mon, 13 May 2024 00:00:00 +0000</pubDate></item><item><title>Trans-Tokenization</title><link>https://pieter.ai/trans-tokenization/</link><description>we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to adapt a high-resource monolingual LLM to a new target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language. For this, we leverage a translation resource covering both the source and target languages. We validate our method with the Tweeties, a series of trans-tokenized LLMs, and demonstrate their competitive performance on various downstream tasks across a small but diverse set of languages.</description><guid isPermaLink="false">https://pieter.ai/trans-tokenization/</guid><pubDate>Wed, 10 Jul 2024 00:00:00 +0000</pubDate></item><item><title>NanoGPT-inference</title><link>https://pieter.ai/blog/2025/nanogpt-inference/</link><description>LLM inference, the engineering behind serving LLMs efficiently and economically, is becoming increasingly important. In this post, I'll show you how to speed up LLM inference with various techniques. I also release the code of each inference engine as a simple extension to Karpathy's NanoGPT.</description><guid isPermaLink="false">https://pieter.ai/blog/2025/nanogpt-inference/</guid><pubDate>Tue, 28 Oct 2025 00:00:00 +0000</pubDate></item></channel></rss>