Serving large language models efficiently is highly complex and full of trade-offs. Most inference providers focus on having acceptable generation speeds (measured in tokens/sec) while still utilizing the GPUs as much as possible. However, the trade-off here is that larger batch sizes improve utilization (more FLOP/s, more total $\frac{\text{tokens}}{\text{sec} \cdot \text{batch}}$) at the cost of a worse experience per request (lower $\frac{\text{tokens}}{\text{sec}}$ per request). There are many pitfalls, so definitely take a read at my old colleague's LLM inference blog post.
For some use cases like Deep Research, data generation, or scientific evaluations, we don't care that much about the per-request generation speed or TTFT. We want to optimize for the total throughput and utilize our hardware as much as possible. This is quite a unique setting and quite the opposite of interactive uses, like ChatGPT, where we have to serve at a certain token/sec for a pleasant user experience. Similarly, the processing of the input tokens, called prefill, has the same trade-offs affecting Time To First Token (TTFT).
So if you want to do — say — one million inference jobs, you still need a lot of logic around the existing engines. vLLM itself is good at keeping GPUs utilized with their BatchScheduler, prefill/decode interlacing, etc. However, you cannot submit all those jobs in one go. So I found myself implementing batching, checkpointing, and restart logic quite a few times. Naturally in a suboptimal way. The usual flow when using vLLM's OpenAI-compatible API is we want to read in a dataset, usually from Hugging Face or a local file with one sample per line, and create a request to do something with that data. When the LLM inference is done, the generated sequence just needs to be saved. Especially when some tasks in a batch take longer, we don't want to wait until every request is processed since most of the GPUs will be idling, as illustrated in Figure 1. Again, implementing this correctly becomes quickly non-trivial.

So naturally, I overengineered a solution and created a python
package that can handle this logic: llmq
Architecturally, llmq
is quite simple. It relies on
RabbitMQ for the actual queuing logic and mostly offers convenience
features to (i) start workers, (ii) submit jobs, and (iii) handle
reliability issues. Since I used RabbitMQ quite a lot for my master's
thesis in the distant past, I knew this would not be a bottleneck. It's
a really well-engineered scheduler that can handle hundreds of thousands
of requests per second and the queue length is mostly only limited by
the host's memory (+swap!). Combined with pre-fetching, acks, and
routing—useful for pipelines—RabbitMQ is a good fit.

The main command is
llmq submit <some-queue> jobs.jsonl
, which makes a
job out of every line and submits them to the broker. There a worker can
collect each job and process it. This goes back to the broker and gets
streamed to the client that ran llmq submit
. It's queues
all the way down, so if something happens your data or results should be
fine (disclaimer: I am still working on the reliability features, so
you'll need to do some manual tasks for now).
From the worker’s side, I tried to keep a parallel with the
vllm serve
commands, but with an added queue name:
llmq worker run Unbabel/Tower-Plus-9B translation-queue
. I
mostly chose vLLM as an inference engine because, again, I am more
familiar with it than with others. It’s really solid and can handle
large batches well, exactly what we need for this project. In the
future, I also plan to add some more worker types, like SGLang and
deduplication, so that llmq could really be the one library you need for
all your synthetic data generation needs.
For now, my library is in quite active development. So I'm playing around a lot with different API variants, mostly for pipelines. However, for simple jobs (i.e., just one stage), llmq works quite well already and I've made the following datasets with it publicly available:
- Using Tower-Plus-72B, I created the 🤗 fineweb-edu-german-mt dataset to have some high-quality German pre-training data
- Similarly, the 🤗 fineweb-edu-dutch-mt dataset was generated using Tower-Plus-9B for Dutch
- For instruction-tuning, I also created 🤗 nemotron-dutch-mt using Tower-Plus-9B
So while the library is already useful now, there’s still quite a lot on the roadmap:
- Pipelines with multiple processing stages (working on in #9)
- Worker types for data cleaning
- Async fetch and submit
- Auto-config of vLLM’s arguments
- Easier message formatting
- A python API
Check out the GitHub repository to get started with batch inference, and follow me on Twitter/X @pieterdelobelle for updates on this and other LLM projects!