Pieter Delobelle

LLM inference, the engineering behind serving LLMs efficiently and economically, is becoming increasingly important. In this post, I'll show you how to speed up LLM inference with various techniques. I also release the code of each inference engine as a simple extension to Karpathy's NanoGPT.

Code

blogpost Posted on January 21, 2025

Deploying DeepSeek R1 'locally'

A Practical Guide with Ray and vLLM

This guide demonstrates how to deploy DeepSeek R1 651B in fp8 across multiple H100 nodes using Ray and vLLM.

blogpost Posted on September 11, 2024

European Tweeties

Creating language models for all 24 EU languages

Existing language models are either focused on English or multilingual, with reasonable performance in the bigger languages. However, many languages are not supported or perform a lot worse than when prompted in English. To address this, we are creating 23 new language models for all EU languages.

blogpost Posted on August 30, 2024

Setting up a decent SentencePiece tokenizer

Reasonable monolingual tokenization from noisy data

Tokens are what makes language models understand language. Each sentence gets split into tokens and then converted to embeddings. So we want good tokens that cover as much of a language's words as possible, with our limited vocabulary. Luckily there are many libraries, like SentencePiece. However, the configuration is not trivial to get decent results on noisy data.

blogpost Posted on January 05, 2024

Evaluating Dutch LLMs

SQUAD-NL

blogpost Posted on December 27, 2023

Dutch Chat Toolkit

Creating retrieval-augmented chatbots

A lot of NLP technologies are easy to use for beginners, but creating and deploying a chatbot is still a bit tricky. Let's make a Python CLI toolkit to quickly create a chatbot with a web-based user interface.

Code

blogpost Posted on November 04, 2023

Building a language learning app

Day 3: Prompting and basic UI

With the current state of transformer models for text and speech, I believe that there is an opportunity to make fully immersive language learning apps that can tailor their content to what the user wants to learn. In this series, I try to work out a demo using different NLP technologies.

blogpost Posted on November 01, 2023

Blog

NanoGPT-inference - Tensor Parallelism

Scaling inference across multiple GPUs

NanoGPT-inference - Baseline

Introduction to LLM inference and the roofline model

NanoGPT-inference - Sampling

Fused kernels with rejection sampling

NanoGPT-inference - KV Caching

Key-value caching for efficient inference

NanoGPT-inference - CUDA Graphs

Optimize GPU execution with CUDA graphs

NanoGPT-inference

LLM inference from scratch

Deploying DeepSeek R1 'locally'

A Practical Guide with Ray and vLLM

European Tweeties

Creating language models for all 24 EU languages

Setting up a decent SentencePiece tokenizer

Reasonable monolingual tokenization from noisy data

Evaluating Dutch LLMs

SQUAD-NL

Dutch Chat Toolkit

Creating retrieval-augmented chatbots

Building a language learning app

Day 3: Prompting and basic UI

Building a language learning app

Day 2: setting up the app

Building a language learning app

Day 1: planning

Migrating from HuggingFace AdamW

Drop-in replacement optimizer with learning schedule

Updating RobBERT (part 2)

Bringing a language model to 2022

Updating RobBERT

Bringing a language model to 2022