Hugging Face TGI: Serving Open Models at Scale
Actualizado: 2026-05-03
Serving an open LLM in production is not trivial: keeping the GPU saturated, batching requests without blocking, applying quantization without losing quality, and exposing an OpenAI-compatible API means solving several problems at once. Text Generation Inference (TGI)[1] from Hugging Face tries to resolve them with a coherent stack. It is one of the more mature options in the open ecosystem, though with a licence change that deserves attention.
Key takeaways
- TGI supports continuous batching, FlashAttention v2, tensor parallelism, and quantization (bitsandbytes, GPTQ, AWQ) with minimal configuration.
- For Hugging Face Hub models, TGI is the lowest-friction path — load by ID, no format conversion.
- TGI 2.0+ changed to a restrictive commercial licence for production use: evaluate the terms before adopting.
- For raw throughput on high-end GPUs, vLLM usually wins. For CPU or Apple Silicon, llama.cpp is better.
- The OpenAI-compatible API (
/v1/chat/completions) enables hosted-to-self-hosted transitions without code changes.
What TGI Is
TGI is an inference server specialised in text generation. Written mostly in Rust (the router) and Python (the worker using PyTorch). Supports the most popular hub models: Llama 2, Mistral, Falcon, StarCoder, BLOOM, CodeLlama, Yi, MPT, Phi, and variants.
Key capabilities:
- Continuous batching: groups concurrent requests without waiting for the slowest. Maximises GPU utilisation.
- FlashAttention v2: optimised attention implementation reducing memory and accelerating computation.
- PagedAttention: KV-cache memory management inspired by vLLM (though vLLM implements it better).
- Tensor parallelism: split the model across multiple GPUs.
- Quantization: bitsandbytes (NF4, FP4, INT8), GPTQ, AWQ, EETQ. 4-bit or 8-bit inference with minimal effort.
- SSE streaming: tokens sent to client as generated.
- Guided generation: grammars, regex, JSON schema via Outlines.
- OpenAI-compatible API:
/v1/chat/completionsas an additional layer.
Minimal Deployment
docker run --gpus all --shm-size 1g -p 8080:80
-v $PWD/data:/data
ghcr.io/huggingface/text-generation-inference:latest
--model-id mistralai/Mistral-7B-Instruct-v0.2
--quantize bitsandbytes-nf4After a few minutes loading, the endpoint responds at POST /generate with streamed tokens. On Kubernetes, the official chart covers deployments with GPU operator; pods need nvidia.com/gpu: 1 (or more for tensor parallelism).
Where TGI Shines
TGI is the right option for:
- Direct Hugging Face Hub models: load by ID, no format conversion.
- bitsandbytes-quantized models: native NF4, FP4, INT8 support.
- High-quality SSE streaming: low first-token latency, stable throughput.
- Hugging Face Inference Endpoints: TGI is the engine behind their managed service.
- Transformers integration: same tool family, proven compatibility.
If you’re already in the HF ecosystem, TGI is the lowest-friction path.
Where Others Surpass
- Raw throughput: vLLM[2] squeezes more tokens/second on high-end GPUs thanks to its well-implemented PagedAttention, and is Apache 2.0.
- CPU or Apple Silicon: llama.cpp[3] / Ollama[4] are better for GPU-less inference.
- Exotic models: TGI covers the popular ones; less common or very new models may be unsupported.
- Licence: TGI 2.0+ changed to a restrictive commercial licence for production use. Companies that assumed Apache code are re-evaluating — some have migrated to vLLM because of it.
High-Impact Optimisations
Three small tunings with large effect:
--max-batch-prefill-tokens: total token cap in prefill phase (the costliest). Higher means more concurrency but more VRAM.--max-total-tokens: maximum context window per request. Tighter means less memory use.--quantize gptqor--quantize awq: better than bitsandbytes if you have a pre-quantized model.
Measure throughput before and after with locust[5] or vegeta[6] to validate real impact.
OpenAI-Compatible API
curl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "tgi",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}'This lets you use LangChain[7], LiteLLM[8], and OpenAI SDKs without code changes. The transition from a hosted model to self-hosted goes from weeks to minutes.
Production Operation
Checklist for serious TGI ops:
- Health checks against
/health. - Prometheus metrics at
/metrics— exposes latency, throughput, and VRAM usage. - Limit concurrency at router level to avoid OOM in spikes.
- Backup the downloaded model: if the hub is unreachable you can’t start.
- Monitor GPU temperature — GPUs at 90°C degrade and fail.
- Plan for CUDA/driver updates — TGI is version-sensitive.
Alternatives to Consider
- vLLM[2]: better general throughput, very active community, Apache 2.0.
- llama.cpp[3] / Ollama[4]: CPU and Apple Silicon, simpler deployment.
- TensorRT-LLM[9]: performance ceiling on NVIDIA GPUs, but high operational complexity.
- LMDeploy[10]: very good performance on certain models.
Conclusion
TGI remains a robust, sensible choice for most teams serving open models: support for popular hub models, easy quantization, familiar API, and HF-ecosystem integration. The 2.0 licence change is the most important thing to evaluate before adopting in commercial contexts. For absolute throughput on top GPUs, vLLM usually wins; for maximum simplicity on CPU, llama.cpp. If you have no licence constraints and are in the HF ecosystem, TGI remains the lowest-friction option.