Two years ago, serving language models in production was an exercise in fragmentation. Every team hitting the problem ended up choosing among a dozen options (Hugging Face Text Generation Inference, DeepSpeed-MII, FasterTransformer, llama.cpp servers, homegrown PyTorch setups), and the decision was rarely final. Today, while serious alternatives persist, vLLM has become the default engine for most teams serving models on GPU. And the growth isn’t accidental: it’s the result of a very consistent improvement pace over two years.
This post reviews important vLLM changes over the last months and frames them in terms of what they mean for operators. Not a feature list, but a read of what problems the improvements solve and what remains to solve.
The maturity moment
vLLM started as an academic project focused on a specific idea (PagedAttention, efficient KV memory management during inference) and has grown into a platform with a predictable release cycle, a stable API, and an enterprise user ecosystem. Consolidation shows in details: documentation is better than a year ago, the project’s published benchmarks are more honest, integration with orchestration frameworks (Ray, Kubernetes) is first-class.
But what’s most relevant is technical: PagedAttention’s original throughput advantage was notable but not abysmal. Today, vLLM in its current version has several accumulated advantages over equivalent alternatives that add up to a 2x or 3x factor in throughput for typical workloads. That translates directly into infrastructure cost.
Prefix caching: what’s changed most
The most impactful improvement of the last year is automatic prefix caching. When many requests share a prefix (typical case: an application’s system prompt, or a RAG’s common context), vLLM now detects the overlap and reuses the already-computed attention cache. The practical effect is that workloads previously executing the same prompt head thousands of times a day now do it once per node.
For applications with lots of repetition (assistants with long fixed instructions, RAG with recurring context, chat with history shared across turns), the savings in latency and cost are very real. In cases I’ve seen measured, latencies in the 500-800 ms range have dropped to 150-200 ms on the same hardware and model, simply by enabling prefix caching. And aggregate throughput scales proportionally.
Integration is frictionless: nothing to configure, works automatically. For a team already using vLLM, moving to the prefix-caching version is just an upgrade.
Speculative decoding: the second big improvement
The technique uses a small fast model to predict several tokens ahead and then verifies those predictions with the main model. If predictions are correct, the big model validates in a single pass what would have required several, and effective latency drops.
vLLM has incorporated speculative decoding with several draft-model options. Latency improvement is especially noticeable in large models (70B+) where each individual token is expensive. For interactive workloads where user experience depends on time-to-first-token and generation speed, it’s a qualitative change.
The only consideration is that speculative decoding adds operational complexity: you need to deploy the draft model alongside the main one, and tune the acceptance ratio for your specific workload. For most standard deployments, the default ratio works, but atypical cases are worth measuring.
Multi-LoRA: a specific case
For teams serving multiple fine-tunes of the same base model (typical in multi-tenant SaaS where each customer has their adapter), vLLM has matured multi-LoRA support significantly. You can load a base model and hundreds of LoRA adapters, and inference routes to the correct adapter per request without reloading the model.
This transforms the economics of multi-tenant LLM services. Instead of deploying a model per customer (which doesn’t scale), you deploy a shared base model and an adapter per customer. Adapters are small (few MB), so you can have hundreds active in GPU memory simultaneously.
The typical application of this in 2025 is B2B SaaS with personalized AI: each customer trains their own adapter on their data, and the service serves all with one base model. vLLM is the piece that makes this operationally viable.
Comparison with alternatives
The serious alternatives remain Hugging Face’s Text Generation Inference and NVIDIA’s TensorRT-LLM.
TGI has improved a lot over the last year and now has features comparable to vLLM in most areas. It’s a good option if you’re already integrated in the Hugging Face ecosystem and value consistency with their other tools. In pure throughput, independent benchmarks still give slight edge to vLLM, but the gap is smaller than a year ago.
TensorRT-LLM offers the highest throughput on NVIDIA hardware when you can dedicate time to specific optimization. The price is a more complex compilation pipeline and less flexibility for dynamic workloads. For high-volume services with predictable workloads, it’s worth considering; for services with variable workloads or frequent model changes, vLLM is more comfortable.
There’s a third, lighter line represented by llama.cpp and derivatives like Ollama. They don’t compete with vLLM on throughput but on simplicity and flexibility. For prototypes, local applications, and small deployments, they remain excellent. For services handling tens or hundreds of concurrent requests, vLLM is superior by design.
What remains a weak point
vLLM isn’t perfect, and its limits should be clear.
Multi-GPU support has improved but remains more fragile than single-GPU. Tensor-parallelism configurations across multiple GPUs are well documented, but in practice problems can arise with certain large models requiring specific tuning.
Non-NVIDIA hardware support lags. vLLM works on AMD with ROCm and has been ported to Intel Habana, but the experience on those chips is clearly inferior to NVIDIA. For those on alternative hardware, the ecosystem isn’t mature yet.
And memory consumption during startup is high. vLLM loads the model and KV cache buffers aggressively, and for large models on GPUs with limited VRAM, fitting can be hard. There are configuration options for this, but they demand understanding the model quite well.
What it means for operators
My conclusion after a year operating services with vLLM in production is concrete. For practically any team serving LLMs on NVIDIA GPU with non-trivial workloads, vLLM is the option with the best return on time investment. Recent improvements (prefix caching, speculative decoding, multi-LoRA) have widened the lead over alternatives over the last six months.
What I’d recommend a team starting today: start on the latest stable version, measure with your real workload before micro-optimizing, enable prefix caching from the start if your prompts have repeated parts, and consider speculative decoding only if measurements show latency is a real issue. Don’t try to tune all knobs at once; vLLM works well with default config for most cases.
Medium-term, what I expect is that vLLM maintains its improvement pace and becomes taken-for-granted infrastructure, the way Redis or PostgreSQL are in their respective niches today. For people building products on LLMs, that stability is good news: less time on infrastructure, more on the product.