vLLM has become the reference for serving LLMs on GPU. PagedAttention, continuous batching, OpenAI-compatible API. How to deploy it well and when it is worth it.
Read moreTag: llm serving
SGLang: Fine Control Over LLM Execution
SGLang adds a DSL for controlling LLM generation with constrained decoding, branching, and prefix caching. When it beats vLLM and why RadixAttention changes the arithmetic.
Read moreHugging Face TGI: Serving Open Models at Scale
Text Generation Inference is Hugging Face’s serving stack for LLMs. When it makes sense, what optimisations you get for free, and real limits.
Read more