LLM inference, the engineering behind serving LLMs efficiently and economically, is becoming increasingly important. In this post, I'll show you how to speed up LLM inference with various techniques. I also release the code of each inference engine as a simple extension to Karpathy's NanoGPT.

I recently gave a course at KU Leuven on LLM inference, which was a lot of fun. The course was specifically for engineers and I think that LLM inference is quite a nice topic for that: very technical and engineering-heavy, while still focusing on SOTA research. It's also a field I really came to appreciate over the last few years, especially with colleagues like Piotr writing in-depth blog posts on the topic. There is a lot of depth to successfully serving LLMs efficiently and economically; understanding these trade-offs is then the difference between a profitable service and a loss-making one. So with enough depth and engineering challenges, it sounded like a great course!

To prepare properly, I decided a simple playground codebase could be helpful. Such a codebase allows me to test out theoretical results and get some "real" numbers. While I do have experience as an LLM inference engineer from my time at Aleph Alpha, I never implemented fundamental techniques like KV caching from scratch. I believe this is the best way to really understand a topic, so I decided to build a small repository with common techniques to speed up LLM inference.

I started from NanoGPT, an easy-to-understand ~300 loc GPT implementation with a focus on simplicity and education. More importantly, it has a very rudimentary .generate() that could use some attention. Unlike the Hugging Face transformers or vLLM, NanoGPT doesn't use a lot of abstractions, so we can get straight to work on the core ideas.

I implemented various LLM inference optimizations, most of them compounding on the previous ones (unless it got too messy). I also released the code as a GitHub repository, so you can try it out yourself. The end result is a nice speed-up with some clear progress: from 580 tokens/sec to over 14k tokens/sec.

In the following posts, I'll go through the different optimizations. I assume you have a decent understanding of LLMs themselves, otherwise I would recommend starting with Karpathy's nanoGPT or his videos if you don't like text. The code of these optimizations are based on NanoGPT, so it's a good starting point.

The end goal of this project is not to build a production-ready inference engine, but to provide an educational codebase with different techniques and optimizations and their effects on performance. If you are looking for an engine that is usable in production, I recommend vLLM or sglang. Both implement all of the optimizations I'll cover here and more.

Baseline

Introduction to LLM inference and the roofline model

→

KV Caching

Key-value caching for efficient inference

→

Sampling

Fused kernels with rejection sampling

→

CUDA Graph Caching

Optimize GPU execution with CUDA graphs

→

Tensor Parallelism

Scale inference across multiple GPUs

→

FP8 Quantization

Reduce memory footprint with 8-bit floating point

🚧

Mixture of Experts

How the MLP's roofline changes with MoE

🚧

Unpadding

No more wasted space by padding

🚧

Acknowledgments

This blog post was inspired by my preparations for a course at KU Leuven on LLM inference. I want to thank the students for their engagement and feedback. I also want to thank Miryam de Lhoneux and Luc De Raedt for giving me this opportunity.

blogpost

NanoGPT-inference

LLM inference from scratch