Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Inteligencia Artificial

vLLM in 2025: the improvements that matter to LLM-serving teams

vLLM in 2025: the improvements that matter to LLM-serving teams

Actualizado: 2026-05-03

Two years ago, serving language models in production was an exercise in fragmentation. Every team hitting the problem ended up choosing among a dozen options, and the decision was rarely final. Today, while serious alternatives persist, vLLM has become the default engine for most teams serving models on GPU. And the growth isn’t accidental: it’s the result of a very consistent improvement pace over two years.

This post reviews important vLLM changes over the last months and frames them in terms of what they mean for operators.

Key takeaways

  • Automatic prefix caching is the most impactful improvement: latencies in the 500-800 ms range have dropped to 150-200 ms for workloads with repeated prompts, without changing anything in the application.
  • Speculative decoding reduces latency especially for 70B+ models, but adds operational complexity.
  • Multi-LoRA support transforms the economics of multi-tenant services: one shared base model + per-customer adapters.
  • Multi-GPU support remains more fragile than single-GPU for some large models.
  • Non-NVIDIA hardware (AMD ROCm, Intel Habana) lags in experience and maturity.

The maturity moment

vLLM started as an academic project focused on PagedAttention (efficient KV memory management during inference) and has grown into a platform with a predictable release cycle, stable API, and enterprise user ecosystem. Consolidation shows in details: better documentation, more honest published benchmarks, first-class integration with orchestration frameworks (Ray, Kubernetes).

Most relevant technically: PagedAttention’s original throughput advantage was notable but not abysmal. Today, vLLM has several accumulated advantages over equivalent alternatives that add up to 2x or 3x throughput for typical workloads. That translates directly into infrastructure cost.

Prefix caching: what’s changed most

The most impactful improvement is automatic prefix caching. When many requests share a prefix (typical case: an application’s system prompt, or a RAG’s common context), vLLM detects the overlap and reuses the already-computed attention cache. Latencies in the 500-800 ms range have dropped to 150-200 ms on the same hardware and model, simply by enabling prefix caching. And aggregate throughput scales proportionally.

Integration is frictionless: nothing to configure, works automatically. For a team already using vLLM, moving to the prefix-caching version is just an upgrade.

Speculative decoding: the second big improvement

The technique uses a small fast model to predict several tokens ahead and then verifies with the main model. If predictions are correct, the big model validates in a single pass what would have required several, and effective latency drops.

vLLM has incorporated speculative decoding with several draft-model options. Latency improvement is especially noticeable in large models (70B+). For interactive workloads where user experience depends on time-to-first-token, it’s a qualitative change.

The consideration is that speculative decoding adds operational complexity: you need to deploy the draft model alongside the main one.

Multi-LoRA: a specific case

For teams serving multiple fine-tunes of the same base model (typical in multi-tenant SaaS where each customer has their adapter), vLLM has matured multi-LoRA support significantly. You can load a base model and hundreds of LoRA adapters, and inference routes to the correct adapter per request without reloading the model.

This transforms the economics of multi-tenant LLM services. Instead of deploying a model per customer, you deploy a shared base model and an adapter per customer. Adapters are small (few MB), so you can have hundreds active in GPU memory simultaneously.

Comparison with alternatives

Serious alternatives remain Hugging Face’s TGI and NVIDIA’s TensorRT-LLM.

TGI has improved a lot and now has features comparable to vLLM in most areas. Good option if already integrated in the Hugging Face ecosystem.

TensorRT-LLM offers the highest throughput on NVIDIA hardware when you can dedicate time to specific optimization. The price is a more complex compilation pipeline. For high-volume services with predictable workloads, worth considering; for services with variable workloads or frequent model changes, vLLM is more comfortable.

llama.cpp and derivatives (Ollama, LM Studio) don’t compete on throughput but on simplicity. Excellent for prototypes and local applications; for services handling tens or hundreds of concurrent requests, vLLM is superior by design.

What remains a weak point

  • Multi-GPU support has improved but remains more fragile than single-GPU.
  • Non-NVIDIA hardware support lags. vLLM works on AMD with ROCm, but the experience is clearly inferior to NVIDIA.
  • Memory consumption during startup is high. vLLM loads the model and KV cache buffers aggressively. For large models on GPUs with limited VRAM, fitting can be hard.

What it means for operators

For practically any team serving LLMs on NVIDIA GPU with non-trivial workloads, vLLM is the option with the best return on time investment. Recent improvements (prefix caching, speculative decoding, multi-LoRA) have widened the lead over alternatives.

My recommendation to a team starting today:

  1. Start on the latest stable version.
  2. Measure with your real workload before micro-optimizing.
  3. Enable prefix caching from the start if your prompts have repeated parts.
  4. Consider speculative decoding only if measurements show latency is a real issue.
  5. Don’t try to tune all knobs at once.

Medium-term, I expect vLLM to maintain its improvement pace and become taken-for-granted infrastructure, the way Redis or PostgreSQL are in their respective niches today. For people building products on LLMs, that stability is good news: less time on infrastructure, more on the product.

Was this useful?
[Total: 10 · Average: 4.4]

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.