vLLM: Serving LLMs in Production with Very High Throughput

vLLM serves language models on GPU using PagedAttention and continuous batching, two techniques that multiply throughput compared with a naive server. It exposes an OpenAI-compatible API, so migrating an existing application only requires changing the base URL and deploying the right binary.

October 5, 2024 7 min 304 4.5