Pieter Delobelle | Nederlandse taalmodellen: noodzaak of luxe?

Abstract

GPT‑5 spreekt vloeiend Nederlands. Hebben we dan nog Nederlandse modellen nodig? Een technische blik op waar meertalige modellen stuk gaan — en waarom bouwen voor het Nederlands de moeite blijft.

De talk is opgebouwd rond drie vragen: taal en cultuur (waarom tokenization en data‑ratio's Engels structureel bevoordelen), sovereign AI (wat er gebeurt met je prompts in de cloud, en wat lokaal draaien realistisch kost), en onze waarden (hoe de keuze van trainingsdata — DeepSeek vs. westerse modellen, en het werk rond GeitJe en ChocoLlama — bepaalt wat een model uitstraalt).

Bedoeld voor een technisch publiek dat comfortabel is met transformers. Geen Nederlands vereist.

Outline

What the talk covers

01 Is chatGPT not good enough? The three‑frame question that organises the talk — Language & culture, Sovereign AI, and Our values.
02 Language & culture. Why English‑first tokenizers cost Dutch ~50% extra tokens per word, what that does to attention layers, and how RobBERT's tokenizer brings fertility down to 1.20.
03 Sovereign AI. Where your ChatGPT conversations actually live, the reach of US authorities, and why running models locally is increasingly viable — even if frontier deployment still costs ~$12M in H100s.
04 Our values. DeepSeek as a great model that nevertheless reflects its creators' values (Buyl et al., 2024); the GeitJe takedown, ChocoLlama's data curation, and where GPT‑NL fits in.
05 So what do we need? More Dutch datasets, more permissively‑licensed data, and more serious research into small language models.

Nederlandse taalmodellen: noodzaak of luxe?

What the talk covers

The deck

Further reading