vLLM has consolidated as the most widely adopted LLM serving engine in production. A review of recent improvements, what changes for operators, and what remains a weak point.
Read moreTag: inferencia
ONNX Runtime at the Edge: Portable, Fast Inference
One model, many targets. ONNX Runtime solves ML runtime fragmentation at the cost of giving up a little ceiling on each individual platform.
Read moreHugging Face TGI: Serving Open Models at Scale
Text Generation Inference is Hugging Face’s serving stack for LLMs. When it makes sense, what optimisations you get for free, and real limits.
Read more