Deploying DeepSeek R1 with Ray and vLLM: A Practical Guide
This guide walks through deploying DeepSeek R1 using Ray and vLLM, focusing on a basic but functional setup using standard ethernet rather than InfiniBand. While I've been using this setup successfully, your mileage may vary depending on your specific infrastructure.
Prerequisites
- At least 2 nodes with H100 GPUs (required for FP8 support)
- ~1TB disk space recommended
- Standard ethernet connections between nodes
- Basic familiarity with conda environments
Node Setup and Dependencies
First, connect to your nodes using ssh. You might want to use the following port forwarding for what you want the head node to be, since those ports will be what we use for curl and a cool-looking admin dashboard later on:
ssh <head-node> -L 8000:localhost:8000 -L 8265:localhost:8265
For the worker node, a standard SSH connection is sufficient:
ssh <worker-node>
On both nodes, install the required dependencies:
Verify CUDA version (should be 12.2 for H100 nodes):
nvidia-smi
Install PyTorch with CUDA support:
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
Install additional dependencies:
conda install vllm ray transformers
Model Deployment
1. Download Weights
huggingface-cli download deepseek-ai/DeepSeek-R1 --local-dir .
2. Configure Head Node
Set up NCCL for ethernet-based communication:
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=bond0
If you want to use InfiniBand, you can do so by not setting these
environment variables and instead setting the
NCCL_P2P_DISABLE
and NCCL_IB_DISABLE
environment variables to 0. Good luck!
Start the Ray head node:
ray start --block --head --port=6379
3. Configure Worker Node
Connect the worker to the head node using the address shown in the head node's output logs:
ray start --block --address=<head-node-ip>:<port>
4. Launch vLLM on the head node
Launch vLLM with tensor parallelism:
vllm serve r1/ --tensor-parallel-size 16 --max-model-len 48000 --trust-remote-code
With two nodes, we don't have enough VRAM for the KV cache of the
full model. So we set a max-model-len
of 48000 tokens.
Testing the Deployment
Test the deployment using curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "r1/",
"messages": [{"role": "user", "content": "How many Rs are there in strawberry?"}],
"max_tokens": 2000,
"temperature": 0.7
}' | jq -r '.choices[0].message.content'
Example Response
The model provided this detailed (and somewhat amusing) response:
<<think>
Okay, so the user is asking how many Rs are there in the word "strawberry". Hmm, first I need to make sure I spell the word correctly. Let's see: strawberry is spelled S-T-R-A-W-B-E-R-R-Y. Wait, the user wrote "strawberry" instead of strawberry. That's a typo, right? The correct spelling is strawberry with two Rs. But the user might have a typo in the question. Let me check the original question again. They wrote "strawberry". So, that's S-T-R-A-W-B-E-R-R-Y. So how many Rs are there in "strawberry"? Let's break it down letter by letter.
Starting with the letters: S, T, R, A, W, B, E, R, R, Y. So from S to Y, let's count each R. The third letter is R. Then after B and E, there's R, R. So positions 3, 8, and 9. Wait, wait, in "strawberry" as the user wrote, which is missing a letter perhaps? Wait, the correct spelling is S-T-R-A-W-B-E-R-R-Y, which has two Rs at the end. But in the correct spelling, there's an R after B-E, then double R? Let me check. Strawberry is S-T-R-A-W-B-E-R-R-Y. So after the B comes E, then R, R, Y. So that's two Rs at the end. But the user's spelling is "strawberry", which they might have meant strawberry, but even if they typed "strawberry" correctly, how many Rs are there?
[...extensive thinking process continues...]
So I think the answer is three.
</think>
In the word "strawberry," there are **3 R's**. Breaking it down:
- **S**
- **T**
- **R** (1st R)
- **A**
- **W**
- **B**
- **E**
- **R** (2nd R)
- **R** (3rd R)
- **Y**
So the letters are: **S, T, R, A, W, B, E, R, R, Y**. The R's appear at positions 3, 8, and 9.
This response demonstrates how the model methodically breaks down even simple questions in a—honestly–insane amount of tokens.
Performance
In initial testing, the model achieves a generation throughput of about 8.4 tokens per second. While this might seem modest, it's worth noting that we're running with small batch sizes, using only 0.4% of our GPU KV cache. In addition, we aren't running with InfiniBand but standard ethernet, so the synchronization overhead is higher and affecting our performance. To be continued...
Dependencies
For reproducibility, here are the key versions of the dependencies used in this guide:
- vllm==0.6.6.post1
- ray==2.40.0
- torch==2.5.1
- transformers==4.48.0