vLLM: Serving LLMs in Production with Very High Throughput

Tarjeta gráfica GPU con iluminación azul en entorno de servidor

vLLM has settled in as the reference inference server for language models on GPU when what you care about is aggregate throughput. Its PagedAttention algorithm manages the KV cache as if it were paged memory in an operating system, and that single architectural decision explains most of the gap against naive implementations. This article collects what I have learned deploying it in production, and above all when it makes sense and when it does not.

Why PagedAttention changes the rules

Traditional inference servers reserve contiguous memory per request to hold the attention cache, which causes massive fragmentation when requests of very different lengths share the same GPU. PagedAttention breaks that cache into small fixed-size blocks and manages them with a page table, exactly like a kernel does with RAM. You can pack far more simultaneous requests onto the same card without touching the weights.

Continuous batching stacks on top: requests enter and leave the ongoing batch without waiting for the previous group to finish, which eliminates the idle GPU gaps that appear when some answers are short and others are long. In practice, the throughput improvement over a naive server ranges from three to twenty-four times depending on traffic mix, and that number is not marketing: it is memory occupancy math.

vLLM also exposes an OpenAI-compatible API, supports tensor parallelism across GPUs, offers quantization in AWQ, GPTQ, FP8 and INT8, and covers the relevant open family: Llama, Mistral, Qwen, DeepSeek, Phi and Gemma.

Installation and first boot

The minimum entry point is a single command that brings up an HTTP server with the OpenAI API ready to consume from any existing client:

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 32768 \
  --quantization awq

The endpoint sits at http://localhost:8000/v1 and any SDK that talks to the OpenAI API works by only changing the base URL and passing a dummy key. That, in my experience, is what accelerates adoption the most: application code does not change.

Four parameters move the needle. The tensor parallelism size tells vLLM how many GPUs to shard the model across, matching the number of physical cards per node. GPU memory utilization usually sits between ninety and ninety-five percent; pushing higher leaves no headroom for spikes and produces hard-to-reproduce OOM errors. Maximum context length conditions how much KV cache you reserve, so expanding it to thirty-two thousand tokens when real traffic uses two thousand is giving away memory. And quantization, when the model supports it, is almost always a net win.

Quantization, parallelism, and real performance

AWQ offers the best quality-memory trade-off for most Llama and Mistral models and the weights come pre-quantized on Hugging Face, so startup is immediate. GPTQ is equivalent in spirit but with another format. FP8 is interesting only on H100, because on A100 it falls back to slow paths. INT4 compresses a lot but starts to degrade reasoning on long chains, and that does not always show up in short benchmarks.

Tensor parallelism becomes mandatory once the model stops fitting on a single GPU: a Llama 3.1 seventy-billion model in FP16 asks for four eighty-gigabyte A100s, while with AWQ it fits comfortably on two. Pipeline parallelism only pays off once you have exhausted tensor parallelism inside a node and need to cross into another server; inter-node latency penalizes first-token time heavily.

On concrete performance, a Llama 3 eight-billion model on a single eighty-gigabyte A100 gives me around sixty to eighty tokens per second on isolated requests, but once fifty concurrent connections pile on, the aggregate jumps to two or three thousand tokens per second. That jump is the gift from PagedAttention and continuous batching, and it is what makes the same hardware orders of magnitude cheaper per token under real load than serving one at a time.

Observability and advanced features

vLLM exposes Prometheus metrics via a dedicated port. The five I always watch are requests running, requests waiting, GPU KV cache usage percent, time to first token, and end-to-end latency. The official Grafana dashboard covers those signals and the key is to watch the queue: if it grows steadily you do not have a model problem but a capacity problem, and the fix is either horizontal replicas or a more powerful card.

Three advanced features deserve mention. Multi-LoRA serves several adapters on the same base model and switches per request, which is gold when you have done several small fine-tunes. Speculative decoding uses a small draft model that proposes tokens and the main model verifies them, with realistic two to three times speedups. And structured output, via Outlines integration, guarantees JSON valid against a schema, eliminating an entire class of fragile parsers.

vLLM versus TGI and SGLang

Hugging Face’s Text Generation Inference keeps solid engineering and impeccable HF integration, but its licence change complicates some commercial deployments and on pure throughput it sits slightly behind. SGLang is strong at shared-prefix workloads but its community is still small. LMDeploy shines with the Intern family and aggressive quantization, but loses steam outside that niche. vLLM occupies the centre of gravity: it wins on general throughput, keeps an Apache 2.0 licence, and receives improvements nearly every week.

When it is worth it

If your load reaches several million tokens per day and you control the hardware, self-hosting with vLLM beats paying per-token API sooner than people tend to think: the break-even point at October 2024 prices sits around ten million tokens per day for a mid-sized model, and below that figure the commercial API is still cheaper once you count engineering hours and power. Above that line the savings turn aggressive and sovereignty over the model starts to carry strategic weight.

The main mistake I see is treating vLLM as a fire-and-forget binary. It is not: it requires judgement to size maximum context, pick quantization, decide horizontal replicas, and calibrate memory utilization. It also asks you to accept honest limits: it only supports NVIDIA well (ROCm is still experimental in this 0.6 release), startup loads several gigabytes of weights and can take minutes, and some less popular models have less optimized kernels.

If you are going serious production with your own LLMs in 2024, I would start with vLLM directly and only look at alternatives when a concrete problem pushed me out. Trying three servers before committing burns weeks that rarely repay the extra information. Wire metrics from day one, watch queue and KV cache occupancy, and scale by replicas when the queue grows. That covers ninety percent of real-world cases.

Entradas relacionadas