Introducing llmq

llmq is a Python library that efficiently performs batched inference with millions of jobs using self-hosted open-source models. Built on RabbitMQ for reliable queueing, llmq handles the complex logic of batching, checkpointing, and restart mechanisms that are typically needed for large-scale synthetic data generation. With vLLM as the inference engine, it optimizes for total throughput rather than per-request latency, making it ideal for research, data generation, and scientific evaluations where you need to process massive datasets efficiently.

Serving large language models efficiently is highly complex and full of trade-offs. Most inference providers focus on having acceptable generation speeds (measured in tokens/sec) while still utilizing the GPUs as much as possible. However, the trade-off here is that larger batch sizes improve utilization (more FLOP/s, more total $\frac{\text{tokens}}{\text{sec} \cdot \text{batch}}$) at the cost of a worse experience per request (lower $\frac{\text{tokens}}{\text{sec}}$ per request). There are many pitfalls, so definitely take a read at my old colleague's LLM inference blog post.

For some use cases like Deep Research, data generation, or scientific evaluations, we don't care that much about the per-request generation speed or TTFT. We want to optimize for the total throughput and utilize our hardware as much as possible. This is quite a unique setting and quite the opposite of interactive uses, like ChatGPT, where we have to serve at a certain token/sec for a pleasant user experience. Similarly, the processing of the input tokens, called prefill, has the same trade-offs affecting Time To First Token (TTFT).

So if you want to do — say — one million inference jobs, you still need a lot of logic around the existing engines. vLLM itself is good at keeping GPUs utilized with their BatchScheduler, prefill/decode interlacing, etc. However, you cannot submit all those jobs in one go. So I found myself implementing batching, checkpointing, and restart logic quite a few times. Naturally in a suboptimal way. The usual flow when using vLLM's OpenAI-compatible API is we want to read in a dataset, usually from Hugging Face or a local file with one sample per line, and create a request to do something with that data. When the LLM inference is done, the generated sequence just needs to be saved. Especially when some tasks in a batch take longer, we don't want to wait until every request is processed since most of the GPUs will be idling, as illustrated in Figure 1. Again, implementing this correctly becomes quickly non-trivial.

Comparison of bad batching and good batching

So naturally, I overengineered a solution and created a python package that can handle this logic: llmq

Architecturally, llmq is quite simple. It relies on RabbitMQ for the actual queuing logic and mostly offers convenience features to (i) start workers, (ii) submit jobs, and (iii) handle reliability issues. Since I used RabbitMQ quite a lot for my master's thesis in the distant past, I knew this would not be a bottleneck. It's a really well-engineered scheduler that can handle hundreds of thousands of requests per second and the queue length is mostly only limited by the host's memory (+swap!). Combined with pre-fetching, acks, and routing—useful for pipelines—RabbitMQ is a good fit.

The main command is llmq submit <some-queue> jobs.jsonl, which makes a job out of every line and submits them to the broker. There a worker can collect each job and process it. This goes back to the broker and gets streamed to the client that ran llmq submit. It's queues all the way down, so if something happens your data or results should be fine (disclaimer: I am still working on the reliability features, so you'll need to do some manual tasks for now).

From the worker’s side, I tried to keep a parallel with the vllm serve commands, but with an added queue name: llmq worker run Unbabel/Tower-Plus-9B translation-queue. I mostly chose vLLM as an inference engine because, again, I am more familiar with it than with others. It’s really solid and can handle large batches well, exactly what we need for this project. In the future, I also plan to add some more worker types, like SGLang and deduplication, so that llmq could really be the one library you need for all your synthetic data generation needs.

For now, my library is in quite active development. So I'm playing around a lot with different API variants, mostly for pipelines. However, for simple jobs (i.e., just one stage), llmq works quite well already and I've made the following datasets with it publicly available:

Using Tower-Plus-72B, I created the 🤗 fineweb-edu-german-mt dataset to have some high-quality German pre-training data
Similarly, the 🤗 fineweb-edu-dutch-mt dataset was generated using Tower-Plus-9B for Dutch
For instruction-tuning, I also created 🤗 nemotron-dutch-mt using Tower-Plus-9B

So while the library is already useful now, there’s still quite a lot on the roadmap:

Pipelines with multiple processing stages (working on in #9)
Worker types for data cleaning
Async fetch and submit
Auto-config of vLLM’s arguments
Easier message formatting
A python API

Check out the GitHub repository to get started with batch inference, and follow me on Twitter/X @pieterdelobelle for updates on this and other LLM projects!

blogpost

Introducing llmq

A Scheduler for Batched LLM Inference

Linked publications