Deploying DeepSeek R1 with Ray and vLLM: A Practical Guide

This guide walks through deploying DeepSeek R1 using Ray and vLLM, focusing on a basic but functional setup using standard ethernet rather than InfiniBand. While I've been using this setup successfully, your mileage may vary depending on your specific infrastructure.

Update (March 2025): GGUF support for DeepSeek models has been merged into vLLM! Thanks to my colleague Szymon for implementing this feature in PR #13167. With GGUF support, you can now run quantized DeepSeek models on a single node instead of requiring distributed setup across multiple nodes. That said, multi-token prediction is not yet supported with GGUF.

Prerequisites

At least 2 nodes with H100 GPUs (required for FP8 support)
~1TB disk space recommended
Standard ethernet connections between nodes
Basic familiarity with conda environments

Node Setup and Dependencies

First, connect to your nodes using ssh. You might want to use the following port forwarding for what you want the head node to be, since those ports will be what we use for curl and a cool-looking admin dashboard later on:

ssh <head-node> -L 8000:localhost:8000 -L 8265:localhost:8265

For the worker node, a standard SSH connection is sufficient:

ssh <worker-node>

On both nodes, install the required dependencies:

Verify CUDA version (should be 12.2 for H100 nodes):
```
nvidia-smi
```

Install PyTorch with CUDA support:

conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia

Install additional dependencies:
```
conda install vllm ray transformers
```

Model Deployment

1. Download Weights

huggingface-cli download deepseek-ai/DeepSeek-R1 --local-dir .

2. Configure Head Node

Set up NCCL for ethernet-based communication:

export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=bond0

If you want to use InfiniBand, you can do so by not setting these environment variables and instead setting the NCCL_P2P_DISABLE and NCCL_IB_DISABLE environment variables to 0. Good luck!

Start the Ray head node:

ray start --block --head --port=6379

3. Configure Worker Node

Connect the worker to the head node using the address shown in the head node's output logs:

ray start --block --address=<head-node-ip>:<port>

4. Launch vLLM on the head node

Launch vLLM with tensor parallelism:

vllm serve r1/ --tensor-parallel-size 16 --max-model-len 48000 --trust-remote-code

With two nodes, we don't have enough VRAM for the KV cache of the full model. So we set a max-model-len of 48000 tokens.

Testing the Deployment

Test the deployment using curl:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "r1/",
        "messages": [{"role": "user", "content": "How many Rs are there in strawberry?"}],
        "max_tokens": 2000,
        "temperature": 0.7
    }' | jq -r '.choices[0].message.content'

Example Response

The model provided this detailed (and somewhat amusing) response:

<<think>
Okay, so the user is asking how many Rs are there in the word "strawberry". Hmm, first I need to make sure I spell the word correctly. Let's see: strawberry is spelled S-T-R-A-W-B-E-R-R-Y. Wait, the user wrote "strawberry" instead of strawberry. That's a typo, right? The correct spelling is strawberry with two Rs. But the user might have a typo in the question. Let me check the original question again. They wrote "strawberry". So, that's S-T-R-A-W-B-E-R-R-Y. So how many Rs are there in "strawberry"? Let's break it down letter by letter.

Starting with the letters: S, T, R, A, W, B, E, R, R, Y. So from S to Y, let's count each R. The third letter is R. Then after B and E, there's R, R. So positions 3, 8, and 9. Wait, wait, in "strawberry" as the user wrote, which is missing a letter perhaps? Wait, the correct spelling is S-T-R-A-W-B-E-R-R-Y, which has two Rs at the end. But in the correct spelling, there's an R after B-E, then double R? Let me check. Strawberry is S-T-R-A-W-B-E-R-R-Y. So after the B comes E, then R, R, Y. So that's two Rs at the end. But the user's spelling is "strawberry", which they might have meant strawberry, but even if they typed "strawberry" correctly, how many Rs are there?

[...extensive thinking process continues...]

So I think the answer is three.
</think>

In the word "strawberry," there are **3 R's**. Breaking it down:

- **S**
- **T**
- **R** (1st R)
- **A**
- **W**
- **B**
- **E**
- **R** (2nd R)
- **R** (3rd R)
- **Y**

So the letters are: **S, T, R, A, W, B, E, R, R, Y**. The R's appear at positions 3, 8, and 9.

This response demonstrates how the model methodically breaks down even simple questions in a—honestly–insane amount of tokens.

Performance

In initial testing, the model achieves a generation throughput of about 8.4 tokens per second. While this might seem modest, it's worth noting that we're running with small batch sizes, using only 0.4% of our GPU KV cache. In addition, we aren't running with InfiniBand but standard ethernet, so the synchronization overhead is higher and affecting our performance. To be continued...

Dependencies

For reproducibility, here are the key versions of the dependencies used in this guide:

vllm==0.6.6.post1
ray==2.40.0
torch==2.5.1
transformers==4.48.0

blogpost

Deploying DeepSeek R1 'locally'

A Practical Guide with Ray and vLLM