Posted on 2025-10-28

NanoGPT-inference - Baseline

Introduction to LLM inference and the roofline model

Since we start from NanoGPT, we'll use the .generate() function as the baseline. This function is pretty simple: we do a forward pass of the model (self(idx_cond)), sample with top_k and top_p and continue until we reached the maximum length:

Screenshot of the NanoGPT .generate() function.
The NanoGPT .generate() function, credits to Andrej Karpathy.

The heavy lifting is done in the self(idx_cond) method, which is the forward pass of the model. This method is responsible for the actual computation of the model. I encourage you to take a look at the code, it's not too complex and a good way to get familiar with the model.

In these blog posts, I'll mostly focus on a model that you can load with NanoGPT, namely the biggest one of the GPT-2 series: gpt2-xl with 1.5B parameters. This is quite small in today's terms, but at least we can validate if we generate correct tokens.

Sadly, generating tokens with the .generate() function is not very efficient (otherwise I wouldn't have written this blogpost). Let's look at how it performs as a function of the number of requests we send in parallel to the model:

Baseline inference performance.

As we can see, the total throughput is not very good (on an L40S). It starts around 100 tokens per second for our batch size of 1. This is not too bad, but it drops significantly when we send more requests in parallel (i.e. increase the batch size). With 32 concurrent requests, the throughput per request drops to around 15 tokens per second. Note that this chart is per request, so in total we generate $32 \cdot 15 = 480$ tokens per second, definitely not a 32x increase! And if you want to serve a chatbot, you'll need to rent one of these GPUs per 4-8 users. Not a great business model.

So since GPUs are generally quite expensive, we want to use them as much as possible. Luckily, there are quite a few ways to improve the performance of the .generate() function. We'll look at some of them in the next posts.

In the remainder of this post, I'll focus on the different metrics that we can use to evaluate our performance, as well as a brief intro to the roofline model.

Metrics

When serving LLMs, we usually care about the time it takes for our language model to start responding (time to first token) and the time it takes in between tokens (inter-token latency). In summary, we care about:

  • Time to first token (ms): the time it takes for the model to start responding. This is usually the first forward pass, once we get to KV caching we also call this first pass the 'prefill'. Lower is better, and something below 100ms is usually considered good.
  • Inter-token latency (ms): the time it takes for the model to respond to the next token. Lower is better again, and something below 20ms is usually considered good.
  • Throughput per request(tokens/s): The inverse of the inter-token latency. Higher is better, and something above 50 tokens/s is usually considered good.
  • Total throughput (tokens/s): The total number of tokens we can process per second. This is the product of the throughput per request and the batch size.
  • Memory usage (GB): The GPU memory usage of our inference engine. A model needs to fit into the GPU memory, as well as activations, some intermediate tensors, and later on some cached values. The higher the memory usage, the bigger our VRAM needs to be. For reference, an NVIDIA H100 has 80GB of VRAM and a consumer GPU like the NVIDIA 4090 has 24GB.

Different target values for all these metrics depend on the use case. A common setup is a chatbot, where we want to have a low time to first token and a acceptable throughput. For offline data generation, we care less about either of these, but we want to maximize the total throughput.

Roofline model

Current GPUs are quite fast, especially with tensor cores. These tensor cores are dedicated circuitry for matrix multiplications in full, half and many-bit precision, which are a core operation in LLMs. In order to maximize the performance of our inference engine, we need to make sure that we are using these tensor cores as much as possible.

Specification of the NVIDIA H100 GPU.
Spec sheet of the NVIDIA H100 GPU, credits to NVIDIA.

However, a quick glance at a modern GPU's spec sheet shows that the theoretical peak peformance (let's go with half precision for now) is around 2000 TFLOPS, while the memory bandwidth is around 3.35 TB/s. So for every byte we load, we have to do $\frac{1979}{3.35} \approx 590$ floating point operations of work.

For current language models, we don't have that much work to do (either take my word for it or take a look at this blog post from my old colleagues Piotr and Felix). So we need a way to increase the amount of work we do per byte loaded: processing more than one request in parallel. This way we load one byte of our model once, but we use it for 2, 4 or even a few thousand requests. Until we reach the compute limit.

This behavour is expressed in the roofline model. Either we are bottlenecked by the memory bandwidth (memory-bound), or by the compute capability (compute-bound). For a gentle deep dive into the roofline model, check out this video from my old colleague Szymon.

Roofline model.
Roofline model, by Giu.natale - Own work, CC BY-SA 4.0, Link.

The funny thing is that we seem to be compute-bound with our naive baseline: by doubling the batch size, our throughput per request drops quite quickly as we have to wait for the GPU to finish the work. This is mostly because we do not have a KV cache, so every step we have to perform a lot of calculations that we did in the previous step as well. So KV caching is the first optimization we'll look at in the next post

Linked publications