9 Min reading time

Run your own AI at scale: Tuning vLLM for LLM serving (vol1.)

17. 03. 2025
Overview

Learn how to optimize inference for large language models using vLLM, including best practices for GPU parallelism and token batching.

Introduction to vLLM

This is a “short” series describing our findings during the optimization of serving opensource autoregressive LLMs with vLLM library for inference and serving. vLLM is one popular option for serving your own LLMs at scale.

The purpose of this series is not to explain all the great and fascinating details of vLLM library although some explanation is unavoidable if proper argumentation of engine configuration is given.

We will just leave here a beautiful schema of vLLM taken from this paper explaining marvelously elegant PagedAttention algorithm that sits in a core of vLLM soul.

Figure 1 vLLM overview. (Taken from paper Efficient Memory Management for Large Language Model Serving with PagedAttention

Hardware used

For the purpose of our benchmarking, we used a single server node with 4 GPUs NVIDIA A100 SXM4 80 GB.

With this configuration we have 320 GB of vRAM total on our node. Server uses NVLink and NVSwitch technology enabling bandwidth of 600 GB/s.

Dataset used

We are using a dataset created by merging four existing coding datasets and filtering dataset entries related to Java, Spring, JavaScript and React.

The following datasets are used:

We make this dataset (crozai/vllm-benchmark-coding) publicly available on our HuggingFace profile

Basically, we wanted a relevant benchmarking test set for the use case of coding with LLM.

Related topic: If you want to learn about using LLMs for data anonymization, check out this blog.

Model used and model architecture

For the purpose of benchmarking, we used deepseek-ai/DeepSeek-R1-Distill-Llama-70B model which is a model distilled from DeepSeek-R1 but based on meta-llama/Llama-3.3-70B-Instruct model.

Basically, a Llama-3.3-70B enhanced with reasoning capabilities from DeepSeek-R1

deepseek-ai/DeepSeek-R1-Distill-Llama-70B characteristics:

  • Parameter number – 70.6B parameters
  • Parameter precision (Precision) – BF16 (bfloat16) == 2 bytes
  • Number of layers (L) – 80 layers
  • Hidden dimension (H) – 8192
  • Grouped Query Attention mechanism
    • (n_kv_heads) 8 KV Heads
    • (n_q_heads) 64 Query Heads
  • Max context length (max_position_embeddings) 131072 tokens

Benchmark used

For benchmark test performing we used benchmark_serving.py available in vllm GitHub source code repo.

git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks

Each test is performed by invoking benchmark script with the same parameters every time

python vllm/benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --dataset-name hf --dataset-path crozai/vllm-benchmark-coding --hf-split train --max-concurrency 10 --request-rate 10 --num-prompts 50 --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --metric-percentiles 25,50,75,99 --percentile-metrics ttft,tpot,itl,e2el --seed 42

In our tests we used vLLM v0.7.3.

Memory Hunger Games – Pipeline Parallelism vs Tensor Parallelism

Pipeline and tensor parallelism in vLLM are distributed computing strategies used for scaling large language models that don’t fit on a single GPU node. Since our testing model uses 2 bytes precision, memory required to store this model weights is around 140GB.  However, memory required for a KV cache of a single max size prompt of 128K tokens is around

2(K and V cache) x  Precision x H x L x Context = 2 x 2 x 8192 x 80 x 131072 ≈ 312.5 GB

Since our requirement is to support inference of max prompt lengths of 128K tokens we must use parallelism.

Tensor parallelism

When tensor parallelism is used model weights are distributed (sharded) across multiple GPUs. With Grouped Querry Attention models like ours sharding of parameters is performed per Query heads. Kind of logical when you think of it since calculation on each group of heads is performed independently during forward pass in attention mechanism. It is therefore important that number of Query heads in a model is divisible with a number of GPUs we are sharding on. Tensor parallelism in vLLM is configured with –tensor-parallel-size parameter and it is required that:

n_q_heads / tensor-parallel-size = a whole number.

Figure 2Tensor parallelism. (Image from https://blog.squeezebits.com/vllm-vs-tensorrtllm-9-parallelism-strategies-36310)

Pipeline parallelism

This parallelism option distributes model layers across multiple nodes, assigning contiguous layer groups to different devices. It is also used for enabling inference of models too large for a single GPU. It is mostly used to distribute model across multiple nodes(servers) although it can be used to optimize single node with multiple GPUs too. Parameter controlling this parallelization is –pipeline-parallel-size.

Figure 3 Pipeline parallelism. (Image from https://blog.squeezebits.com/vllm-vs-tensorrtllm-9-parallelism-strategies-36310)

Correlation between this two parameters that needs to be respected is given with formula:

tensor-parallel-size x pipeline-parallel-size = total number of GPUs across nodes

Since we are running model on a single node with multiple GPUs we will test tensor parallelism against pipeline parallelism on a single node.

Requirements:

Utilize all 4 GPUs in most optimal way to achieve optimal throughput, inter token latency, time to first token.

Tests:

We performed 2 test runs with tensor-parallel-size=4 and tensor-parallel-size=2, pipeline-parallel-size=2

command:
  --model=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  --trust-remote-code
  --device=cuda
  --disable-log-requests
  --gpu-memory-utilization=0.95
  --tensor-parallel-size=4
========= Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  87.30
Total input tokens:                      7005
Total generated tokens:                  23847
Request throughput (req/s):              0.57
Output token throughput (tok/s):         273.18
Total Token throughput (tok/s):          353.42
---------------Time to First Token----------------
Mean TTFT (ms):                          98.05
Median TTFT (ms):                        85.40
P25 TTFT (ms):                           83.00
P50 TTFT (ms):                           85.40
P75 TTFT (ms):                           99.67
P99 TTFT (ms):                           169.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.95
Median TPOT (ms):                        33.12
P25 TPOT (ms):                           32.88
P50 TPOT (ms):                           33.12
P75 TPOT (ms):                           33.26
P99 TPOT (ms):                           33.52
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.97
Median ITL (ms):                         32.58
P25 ITL (ms):                            32.27
P50 ITL (ms):                            32.58
P75 ITL (ms):                            32.86
P99 ITL (ms):                            78.80

command:
  --model=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  --trust-remote-code
  --device=cuda
  --disable-log-requests
  --gpu-memory-utilization=0.95
  --tensor-parallel-size=2
  --pipeline-parallel-size=2
============ Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  162.19
Total input tokens:                      7005
Total generated tokens:                  23847
Request throughput (req/s):              0.31
Output token throughput (tok/s):         147.03
Total Token throughput (tok/s):          190.22
---------------Time to First Token----------------
Mean TTFT (ms):                          160.11
Median TTFT (ms):                        142.13
P25 TTFT (ms):                           132.10
P50 TTFT (ms):                           142.13
P75 TTFT (ms):                           168.24
P99 TTFT (ms):                           289.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          61.38
Median TPOT (ms):                        61.71
P25 TPOT (ms):                           61.18
P50 TPOT (ms):                           61.71
P75 TPOT (ms):                           61.84
P99 TPOT (ms):                           62.45
---------------Inter-token Latency----------------
Mean ITL (ms):                           61.39
Median ITL (ms):                         60.78
P25 ITL (ms):                            59.80
P50 ITL (ms):                            60.78
P75 ITL (ms):                            61.79
P99 ITL (ms):                            80.23

Conclusion

Pipeline parallelism on a single node is not usable. We will use tensor parallelism set to 4 since we have 4 GPUs and are using model with 64 Query Attention heads.

Multiple requests and Large prompts vs Small prompts.

Two key features in vLLM’s inference scheduling are token batching and prompt chunking, controlled by the –max-num-batched-tokens and –enable-chunked-prefill settings

Token batching refers to how vLLM groups multiple sequences (requests/prompts) into a single forward pass on the model. Instead of processing one sequence at a time, vLLM can batch processing multiple sequences together to better utilize the GPU. –max-num-batched-tokens sets an upper limit on the total number of tokens that can be processed in one iteration (one scheduler step).

Prompt chunking refers to splitting a long prompt (input context) into smaller chunks that can be processed in pieces rather than all at once. In vLLM, this feature is called chunked prefill and is enabled by the flag –enable-chunked-prefill. When a new request comes in with a prompt of N tokens, the model must perform a prefill (prompt processing) step. It processes all N input tokens to build up the key/value cache (e.g. KV cache or internal state) before it can start generating the output tokens. This prefill step can be expensive and take a lot of time if N is large (scenarios like a long chat history or large application codebase in a cotenxt).

After the prefill, the model generates output tokens one by one (this is also called decode stage). Each decode step uses the cached context(KV cache) and appends one new token. By default, without chunked prefill, vLLM prioritizes prefill requests and does not mix prefill and decode in the same batch. When chunked prefill is enable, vLLM changes its scheduling to allow mixing prompt processing with token generation by breaking up the prompt into chunks. This way it prioritizes decode requests (ongoing generations) first.

Requirements:

  • Achieve optimal throughput for both small and large prompts
  • Support max size prompts of 128K
  • Prioritize ongoing requests
  • Support multiple requests processing

Tests:

We performed 4 test runs with –max-model-len=131072 and testing –max-num-batched-tokens to 4 different values, respectively 16K, 32K, 64K and 128K

During test we found that vLLM enables chunked prefill by default for large context models

WARNING 03-09 15:34:35 arg_utils.py:1187] Chunked prefill is enabled by default for models with max_model_len > 32K
command:
  --model=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  --trust-remote-code
  --device=cuda
  --disable-log-requests
  --gpu-memory-utilization=0.95
  --tensor-parallel-size=4
  --max-model-len=131072
  --max-num-batched-tokens=131072
  --enable-chunked-prefill
============ Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  87.50
Total input tokens:                      7005
Total generated tokens:                  23847
Request throughput (req/s):              0.57
Output token throughput (tok/s):         272.54
Total Token throughput (tok/s):          352.60
---------------Time to First Token----------------
Mean TTFT (ms):                          101.78
Median TTFT (ms):                        90.95
P25 TTFT (ms):                           83.05
P50 TTFT (ms):                           90.95
P75 TTFT (ms):                           110.76
P99 TTFT (ms):                           173.08
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.02
Median TPOT (ms):                        33.22
P25 TPOT (ms):                           32.86
P50 TPOT (ms):                           33.22
P75 TPOT (ms):                           33.32
P99 TPOT (ms):                           34.03
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.04
Median ITL (ms):                         32.62
P25 ITL (ms):                            32.28
P50 ITL (ms):                            32.62
P75 ITL (ms):                            32.93
P99 ITL (ms):                            79.15
command:
  --model=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  --trust-remote-code
  --device=cuda
  --disable-log-requests
  --gpu-memory-utilization=0.95
  --tensor-parallel-size=4
  --max-model-len=131072
  --max-num-batched-tokens=65536
  --enable-chunked-prefill
========= Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  89.88
Total input tokens:                      7005
Total generated tokens:                  23847
Request throughput (req/s):              0.56
Output token throughput (tok/s):         265.32
Total Token throughput (tok/s):          343.26
---------------Time to First Token----------------
Mean TTFT (ms):                          118.05
Median TTFT (ms):                        108.83
P25 TTFT (ms):                           105.99
P50 TTFT (ms):                           108.83
P75 TTFT (ms):                           113.94
P99 TTFT (ms):                           185.70
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.89
Median TPOT (ms):                        34.13
P25 TPOT (ms):                           33.75
P50 TPOT (ms):                           34.13
P75 TPOT (ms):                           34.25
P99 TPOT (ms):                           34.49
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.94
Median ITL (ms):                         33.33
P25 ITL (ms):                            33.05
P50 ITL (ms):                            33.33
P75 ITL (ms):                            33.60
P99 ITL (ms):                            101.69
command:
  --model=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  --trust-remote-code
  --device=cuda
  --disable-log-requests
  --gpu-memory-utilization=0.95
  --tensor-parallel-size=4
  --max-model-len=131072
  --max-num-batched-tokens=32768
  --enable-chunked-prefill
============ Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  88.01
Total input tokens:                      7005
Total generated tokens:                  23847
Request throughput (req/s):              0.57
Output token throughput (tok/s):         270.95
Total Token throughput (tok/s):          350.53
---------------Time to First Token----------------
Mean TTFT (ms):                          109.00
Median TTFT (ms):                        103.05
P25 TTFT (ms):                           89.11
P50 TTFT (ms):                           103.05
P75 TTFT (ms):                           111.41
P99 TTFT (ms):                           177.97
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.20
Median TPOT (ms):                        33.31
P25 TPOT (ms):                           33.14
P50 TPOT (ms):                           33.31
P75 TPOT (ms):                           33.63
P99 TPOT (ms):                           33.94
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.23
Median ITL (ms):                         32.75
P25 ITL (ms):                            32.42
P50 ITL (ms):                            32.75
P75 ITL (ms):                            33.03
P99 ITL (ms):                            80.08
command:
  --model=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  --trust-remote-code
  --device=cuda
  --disable-log-requests
  --gpu-memory-utilization=0.95
  --tensor-parallel-size=4
  --max-model-len=131072
  --max-num-batched-tokens=16384
  --enable-chunked-prefill
============ Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  88.07
Total input tokens:                      7005
Total generated tokens:                  23847
Request throughput (req/s):              0.57
Output token throughput (tok/s):         270.79
Total Token throughput (tok/s):          350.33
---------------Time to First Token----------------
Mean TTFT (ms):                          109.70
Median TTFT (ms):                        100.02
P25 TTFT (ms):                           84.75
P50 TTFT (ms):                           100.02
P75 TTFT (ms):                           113.25
P99 TTFT (ms):                           177.40
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.26
Median TPOT (ms):                        33.38
P25 TPOT (ms):                           33.11
P50 TPOT (ms):                           33.38
P75 TPOT (ms):                           33.57
P99 TPOT (ms):                           34.53
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.26
Median ITL (ms):                         32.74
P25 ITL (ms):                            32.42
P50 ITL (ms):                            32.74
P75 ITL (ms):                            33.19
P99 ITL (ms):                            

Conclusion:

All results are almost identical. Since vLLM uses function min(max_model_len, max_num_batched_tokens) to set prompt limit and our hardware has enough resources to cope with max batch sizes we will configure both parameters to maximum context size –max-model-len=131072 –max-num-batched-tokens=131072. Smaller max-num-batched-tokens might be reasonable in more memory restrictive environments.

There are other parameters that also influence the prompt chunking but we will not go into details of those for now and use defaults. Those are –max-num-partial-prefills, –max-long-partial-prefills, –long-prefill-token-threshold

TO BE CONTINUED….

Categories

Tags

Get in touch

If you have any questions, we are one click away.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Contact us

Schedule a call with an expert