Pieter Delobelle | Dutch Language Models

Abstract

What does it actually take to build a language model for a mid‑resource language like Dutch? A research‑level tour through the design space — from training from scratch to remapping the embeddings of an existing English model — and why tokenization decides a surprising amount of the outcome.

The talk walks through five years of Dutch‑LM work: RobBERT as a BERT‑style baseline, BPE‑Knockout on why subword splits don't follow morphology, trans‑tokenization as a cheap way to port a strong English base model to Dutch, and what ChocoLlama's data curation taught us about Dutch‑language corpora.

Outline

What the talk covers

01 Setting the scene. Autoregressive vs masked language modelling, why BERT‑style models remain remarkably efficient for Dutch NLU, and where RobBERT fits in.
02 The tokenization gap. Fertility = 1.09 for English, 1.50 for Dutch on a standard GPT tokenizer; what this costs at every layer of attention, and why RobBERT's tokenizer gets it down to 1.20.
03 BPE‑Knockout and morphology. Why BPE splits land on unintuitive positions for fusional and agglutinative languages — and what gets lost when morphemes get merged.
04 The design space. Training from scratch (GPT‑NL, one day), trans‑tokenization (RobBERT‑2023, Tweety‑7B‑Dutch), finetuning (GEITje), and prompting in English — ranked by cost and by “chance of generating Franken‑Dutch”.
05 ChocoLlama. Data curation lessons across OSCAR, Open Subtitles, Wikipedia, job descriptions (TechWolf), Staatsblad (Bizzy), Project Gutenberg, and legislation (ML6) — what worked and what we'd do differently.
06 What's next. Better instruction‑tuning data and more domains, the open question on model size, and the need for honest generative benchmarks for Dutch.

What the talk covers

The deck