vLLM: Serving LLMs in Production with Very High Throughput

vLLM serves language models on GPU using PagedAttention and continuous batching, two techniques that multiply throughput compared with a naive server. It exposes an OpenAI-compatible API, so migrating an existing application only requires changing the base URL and deploying the right binary.

October 5, 2024 7 min 304 4.5

Artificial Intelligence

SGLang: Fine Control Over LLM Execution

SGLang adds a Python DSL for controlling LLM generation with constrained decoding, parallel branching, and RadixAttention, the structure that indexes the KV cache as a radix trie to reuse shared prefixes across requests. When that pattern exists, speedups over vLLM reach up to 5 times; without it, the advantage shrinks.

June 10, 2024 3 min 198 4.4

Artificial Intelligence

Hugging Face TGI: Serving Open Models at Scale

Text Generation Inference (TGI) is the Hugging Face stack for serving open LLMs in production: continuous batching, 4-bit and 8-bit quantization, streaming, and an OpenAI-compatible API. After a brief restrictive-licence episode in 2023, it returned to Apache 2.0 in version 2.0.

January 3, 2024 4 min 281 4.4