At Pleias, I work on LLM pretraining and synthetic data — most recently releasing Nemotron-Personas-France together with NVIDIA. At KU Leuven, my research spans tokenization, fairness, and Dutch language models.
I created RobBERT, the Dutch language model family in the top 80 on Hugging Face, and have collaborated on most of the open Dutch LLMs — from Tweety to ChocoLlama. On the safety side, I work on measuring bias, reducing toxicity (with Apple, at ICML), and questioning who decides what “fair” means. I also work on tokenization and developed trans-tokenization, a method to translate LLMs from one language to another.
My work has been published at top AI venues including ICML, NAACL, and EMNLP, including work done at Apple and Aleph Alpha. I've done research visits at MilaNLP (Bocconi), HU Berlin, and the Weizenbaum Institute, and accumulated 1,000+ citations across 32 publications. My research has been covered by WIRED, MIT Technology Review, De Tijd, and VTM Nieuws.
At KU Leuven, I was a member of the GenAI advisory board and sit on the council for research policy. I also contributed as an NLP expert to a EU AI Office workshop on the code of practice for general-purpose AI.
I also frequently give talks on AI and language models. I've spoken at companies like Apple, KBC, VRT, TechWolf, ML6, and Superlinear, and at several events, for instance at Wintercircus. Topics range from LLM pretraining and technical deep dives into LLM inference to AI safety and introductory talks about the Dutch NLP ecosystem.