vLLM has consolidated as the most widely adopted LLM serving engine in production. A review of recent improvements, what changes for operators, and what remains a weak point.
Read moreTag: gpu
TensorRT-LLM: NVIDIA GPU Acceleration for LLMs
TensorRT-LLM is the performance ceiling on NVIDIA GPUs. Complex but 2-3x faster than vLLM in optimal cases.
Read morevLLM: Serving LLMs in Production with Very High Throughput
vLLM has become the reference for serving LLMs on GPU. PagedAttention, continuous batching, OpenAI-compatible API. How to deploy it well and when it is worth it.
Read moreZed: A Modern Editor Built for Collaboration
Zed is the Atom-creators editor rewritten from scratch in Rust. When it’s a serious VS Code alternative and what real collaboration it offers.
Read moreHugging Face TGI: Serving Open Models at Scale
Text Generation Inference is Hugging Face’s serving stack for LLMs. When it makes sense, what optimisations you get for free, and real limits.
Read more