Challenges of training LLMs for mid-resource languages.
Abstract
What does it actually take to build a language model for a mid‑resource language like Dutch? A research‑level tour through the design space — from training from scratch to remapping the embeddings of an existing English model — and why tokenization decides a surprising amount of the outcome.
The talk walks through five years of Dutch‑LM work: RobBERT as a BERT‑style baseline, BPE‑Knockout on why subword splits don't follow morphology, trans‑tokenization as a cheap way to port a strong English base model to Dutch, and what ChocoLlama's data curation taught us about Dutch‑language corpora.
Outline