Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Inteligencia Artificial

continuous batching gpu hugging face inferencia llm serving tgi

Hugging Face TGI: Serving Open Models at Scale

January 3, 2024 9 min read 88 reads

Table of contents

Key takeaways
What TGI Is
Minimal Deployment
Where TGI Shines
Where Others Surpass
High-Impact Optimisations
OpenAI-Compatible API
Production Operation
Alternatives to Consider
Conclusion

Actualizado: 2026-05-03

Serving an open LLM in production is not trivial: keeping the GPU saturated, batching requests without blocking, applying quantization without losing quality, and exposing an OpenAI-compatible API means solving several problems at once. Text Generation Inference (TGI)^[1] from Hugging Face tries to resolve them with a coherent stack. It is one of the more mature options in the open ecosystem, though with a licence change that deserves attention.

Key takeaways

TGI supports continuous batching, FlashAttention v2, tensor parallelism, and quantization (bitsandbytes, GPTQ, AWQ) with minimal configuration.
For Hugging Face Hub models, TGI is the lowest-friction path — load by ID, no format conversion.
TGI 2.0+ changed to a restrictive commercial licence for production use: evaluate the terms before adopting.
For raw throughput on high-end GPUs, vLLM usually wins. For CPU or Apple Silicon, llama.cpp is better.
The OpenAI-compatible API (/v1/chat/completions) enables hosted-to-self-hosted transitions without code changes.

What TGI Is

TGI is an inference server specialised in text generation. Written mostly in Rust (the router) and Python (the worker using PyTorch). Supports the most popular hub models: Llama 2, Mistral, Falcon, StarCoder, BLOOM, CodeLlama, Yi, MPT, Phi, and variants.

Key capabilities:

Continuous batching: groups concurrent requests without waiting for the slowest. Maximises GPU utilisation.
FlashAttention v2: optimised attention implementation reducing memory and accelerating computation.
PagedAttention: KV-cache memory management inspired by vLLM (though vLLM implements it better).
Tensor parallelism: split the model across multiple GPUs.
Quantization: bitsandbytes (NF4, FP4, INT8), GPTQ, AWQ, EETQ. 4-bit or 8-bit inference with minimal effort.
SSE streaming: tokens sent to client as generated.
Guided generation: grammars, regex, JSON schema via Outlines.
OpenAI-compatible API: /v1/chat/completions as an additional layer.

Minimal Deployment

bash

docker run --gpus all --shm-size 1g -p 8080:80 
  -v $PWD/data:/data 
  ghcr.io/huggingface/text-generation-inference:latest 
  --model-id mistralai/Mistral-7B-Instruct-v0.2 
  --quantize bitsandbytes-nf4

After a few minutes loading, the endpoint responds at POST /generate with streamed tokens. On Kubernetes, the official chart covers deployments with GPU operator; pods need nvidia.com/gpu: 1 (or more for tensor parallelism).

Where TGI Shines

TGI is the right option for:

Direct Hugging Face Hub models: load by ID, no format conversion.
bitsandbytes-quantized models: native NF4, FP4, INT8 support.
High-quality SSE streaming: low first-token latency, stable throughput.
Hugging Face Inference Endpoints: TGI is the engine behind their managed service.
Transformers integration: same tool family, proven compatibility.

If you’re already in the HF ecosystem, TGI is the lowest-friction path.

Where Others Surpass

Raw throughput: vLLM^[2] squeezes more tokens/second on high-end GPUs thanks to its well-implemented PagedAttention, and is Apache 2.0.
CPU or Apple Silicon: llama.cpp^[3] / Ollama^[4] are better for GPU-less inference.
Exotic models: TGI covers the popular ones; less common or very new models may be unsupported.
Licence: TGI 2.0+ changed to a restrictive commercial licence for production use. Companies that assumed Apache code are re-evaluating — some have migrated to vLLM because of it.

High-Impact Optimisations

Three small tunings with large effect:

--max-batch-prefill-tokens: total token cap in prefill phase (the costliest). Higher means more concurrency but more VRAM.
--max-total-tokens: maximum context window per request. Tighter means less memory use.
--quantize gptq or --quantize awq: better than bitsandbytes if you have a pre-quantized model.

Measure throughput before and after with locust^[5] or vegeta^[6] to validate real impact.

OpenAI-Compatible API

bash

curl http://localhost:8080/v1/chat/completions 
  -H "Content-Type: application/json" 
  -d '{
    "model": "tgi",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

This lets you use LangChain^[7], LiteLLM^[8], and OpenAI SDKs without code changes. The transition from a hosted model to self-hosted goes from weeks to minutes.

Production Operation

Checklist for serious TGI ops:

Health checks against /health.
Prometheus metrics at /metrics — exposes latency, throughput, and VRAM usage.
Limit concurrency at router level to avoid OOM in spikes.
Backup the downloaded model: if the hub is unreachable you can’t start.
Monitor GPU temperature — GPUs at 90°C degrade and fail.
Plan for CUDA/driver updates — TGI is version-sensitive.

Alternatives to Consider

vLLM^[2]: better general throughput, very active community, Apache 2.0.
llama.cpp^[3] / Ollama^[4]: CPU and Apple Silicon, simpler deployment.
TensorRT-LLM^[9]: performance ceiling on NVIDIA GPUs, but high operational complexity.
LMDeploy^[10]: very good performance on certain models.

Conclusion

TGI remains a robust, sensible choice for most teams serving open models: support for popular hub models, easy quantization, familiar API, and HF-ecosystem integration. The 2.0 licence change is the most important thing to evaluate before adopting in commercial contexts. For absolute throughput on top GPUs, vLLM usually wins; for maximum simplicity on CPU, llama.cpp. If you have no licence constraints and are in the HF ecosystem, TGI remains the lowest-friction option.

Was this useful?

[Total: 10 · Average: 4.4]

Post Views: 88

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Hugging Face TGI: Serving Open Models at Scale

Key takeaways

What TGI Is

Minimal Deployment

Where TGI Shines

Where Others Surpass

High-Impact Optimisations

OpenAI-Compatible API

Production Operation

Alternatives to Consider

Conclusion

Related posts

Multi-agent systems: LangGraph vs CrewAI vs Autogen in 2026

How to build a production-ready agent with the Anthropic SDK, step by step

Claude Code vs Cursor vs GitHub Copilot in 2026: a comparison with measured tasks

MCP (Model Context Protocol) in 2026: the complete guide for engineering teams