vLLM: Serving LLMs in Production with Very High Throughput
Actualizado: 2026-05-03
vLLM[1] has settled in as the reference inference server for language models on GPU when what you care about is aggregate throughput. Its PagedAttention algorithm manages the KV cache as if it were paged memory in an operating system, and that single architectural decision explains most of the gap against naive implementations. This article covers the essentials for production deployment and, above all, when it makes sense and when it does not.
Key takeaways
- PagedAttention eliminates KV memory fragmentation and multiplies the number of simultaneous requests per card.
- Continuous batching eliminates wait time between batches and is the main reason for the throughput jump under real load.
- The OpenAI-compatible API lets you migrate existing applications by changing only the base URL.
- AWQ offers the best quality-memory trade-off for Llama and Mistral; FP8 only pays on H100.
- The break-even vs commercial API sits around ten million tokens per day for a mid-sized model.
Why PagedAttention changes the rules
Traditional inference servers reserve contiguous memory per request to hold the attention cache. When requests of very different lengths share the same GPU, this strategy causes massive fragmentation: unused memory blocks that cannot be lent to other requests. PagedAttention breaks the cache into small fixed-size blocks and manages them with a page table, exactly like a kernel does with RAM.
The result is that you can pack far more simultaneous requests onto the same card without touching the weights. Continuous batching stacks on top: requests enter and leave the ongoing batch without waiting for the previous group to finish, eliminating idle GPU gaps that appear when some answers are short and others are long. In practice, throughput improvement over a naive server ranges from 3x to 24x depending on traffic mix — and that number is not marketing, it is memory occupancy math.
vLLM also exposes an OpenAI-compatible API, supports tensor parallelism across GPUs, offers quantization in AWQ, GPTQ, FP8, and INT8, and covers the relevant open family: Llama, Mistral, Qwen, DeepSeek, Phi, and Gemma.
Installation and first boot
The minimum entry point is a single command that brings up an HTTP server with the OpenAI API ready to consume from any existing client:
pip install vllm
python -m vllm.entrypoints.openai.api_server
--model meta-llama/Meta-Llama-3-8B-Instruct
--tensor-parallel-size 4
--gpu-memory-utilization 0.95
--max-model-len 32768
--quantization awqThe endpoint sits at http://localhost:8000/v1 and any SDK speaking the OpenAI API works by changing only the base URL and passing a dummy key. That is what accelerates adoption most: application code does not change.
The four parameters that move the needle most:
tensor-parallel-size: how many GPUs shard the model; match this to physical cards per node.gpu-memory-utilization: between 90% and 95%; pushing higher leaves no headroom for spikes.max-model-len: conditions how much KV cache you reserve; expanding it when real traffic uses short sequences is giving away memory.quantization: with AWQ or GPTQ, almost always a net win on memory and speed.
Quantization and parallelism in practice
AWQ offers the best quality-memory trade-off for most Llama and Mistral models and the weights come pre-quantized on Hugging Face, making startup immediate. GPTQ is equivalent in spirit but with a different format. FP8 is only interesting on H100; on A100 it falls back to slow paths. INT4 compresses a lot but starts degrading reasoning on long chains, which does not always show up in short benchmarks.
Tensor parallelism becomes mandatory once the model stops fitting on a single GPU: a Llama 3.1 70B in FP16 needs four 80 GB A100s, while with AWQ it fits on two. Pipeline parallelism only pays off once you have exhausted tensor parallelism within a node and need to cross into another server; inter-node latency penalizes first-token time heavily.
On concrete performance, a Llama 3 8B on a single 80 GB A100 gives around 60-80 tokens per second on isolated requests, but with fifty concurrent connections the aggregate jumps to 2,000-3,000 tokens per second. That jump is the gift from PagedAttention and continuous batching.

Observability and advanced features
vLLM exposes Prometheus metrics via a dedicated port. The five to watch always are:
- Requests in flight and in queue.
- GPU KV cache occupancy.
- Time to first token.
- End-to-end latency.
The official Grafana dashboard covers them. If the queue grows steadily, the problem is not the model but capacity: add replicas or switch to a more powerful card.
Three advanced features deserve special mention. Multi-LoRA serves several adapters on the same base model and switches per request — gold when you have several small fine-tunes. Speculative decoding uses a small draft model that proposes tokens and the main model verifies them, with realistic 2x-3x speedups. And structured output — via Outlines integration — guarantees JSON valid against a schema, eliminating an entire class of fragile parsers. For your LLM observability stack it is worth connecting vLLM metrics to specialised tools like Langfuse.
vLLM versus TGI and SGLang
Hugging Face’s Text Generation Inference keeps solid engineering and impeccable HF integration, but its licence change complicates some commercial deployments and on pure throughput it sits slightly behind. SGLang is strong at shared-prefix workloads but its community is still small. LMDeploy shines with the Intern family and aggressive quantization, but loses steam outside that niche.
vLLM occupies the centre of gravity: it wins on general throughput, keeps an Apache 2.0 licence, and receives improvements nearly every week. For maximum performance on NVIDIA hardware, TensorRT-LLM can add 2x-3x more at the cost of much greater build complexity.
When self-hosting pays
If your load reaches several million tokens per day and you control the hardware, self-hosting with vLLM beats commercial API sooner than most estimates suggest. The break-even at typical prices sits around ten million tokens per day for a mid-sized model, once you count engineering hours and power. Above that the savings turn aggressive and sovereignty over the model starts to carry strategic weight.
The main mistake is treating vLLM as a fire-and-forget binary. It requires judgement to size maximum context, pick quantization, decide horizontal replicas, and calibrate memory utilization. It also asks you to accept honest limits: NVIDIA-only well (ROCm still experimental), startup loads gigabytes of weights and can take minutes, and less popular models have less optimized kernels.
Wire metrics from day one, watch queue and KV cache, and scale by replicas when the queue grows. That covers ninety percent of real-world cases. If you need efficient fine-tuning on those models before serving them, LoRA and QLoRA are the natural complement for reducing adaptation cost.
Conclusion
vLLM is the most pragmatic option for serving open LLMs on GPU: it combines the highest available throughput under Apache 2.0, an OpenAI-compatible API that removes migration friction, and a community that delivers improvements weekly. The performance jump from PagedAttention and continuous batching turns the same hardware into infrastructure orders of magnitude more efficient under real load.