Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Inteligencia Artificial

Hugging Face TGI: Serving Open Models at Scale

Hugging Face TGI: Serving Open Models at Scale

Actualizado: 2026-05-03

Serving an open LLM in production is not trivial: keeping the GPU saturated, batching requests without blocking, applying quantization without losing quality, and exposing an OpenAI-compatible API means solving several problems at once. Text Generation Inference (TGI)[1] from Hugging Face tries to resolve them with a coherent stack. It is one of the more mature options in the open ecosystem, though with a licence change that deserves attention.

Key takeaways

  • TGI supports continuous batching, FlashAttention v2, tensor parallelism, and quantization (bitsandbytes, GPTQ, AWQ) with minimal configuration.
  • For Hugging Face Hub models, TGI is the lowest-friction path — load by ID, no format conversion.
  • TGI 2.0+ changed to a restrictive commercial licence for production use: evaluate the terms before adopting.
  • For raw throughput on high-end GPUs, vLLM usually wins. For CPU or Apple Silicon, llama.cpp is better.
  • The OpenAI-compatible API (/v1/chat/completions) enables hosted-to-self-hosted transitions without code changes.

What TGI Is

TGI is an inference server specialised in text generation. Written mostly in Rust (the router) and Python (the worker using PyTorch). Supports the most popular hub models: Llama 2, Mistral, Falcon, StarCoder, BLOOM, CodeLlama, Yi, MPT, Phi, and variants.

Key capabilities:

  • Continuous batching: groups concurrent requests without waiting for the slowest. Maximises GPU utilisation.
  • FlashAttention v2: optimised attention implementation reducing memory and accelerating computation.
  • PagedAttention: KV-cache memory management inspired by vLLM (though vLLM implements it better).
  • Tensor parallelism: split the model across multiple GPUs.
  • Quantization: bitsandbytes (NF4, FP4, INT8), GPTQ, AWQ, EETQ. 4-bit or 8-bit inference with minimal effort.
  • SSE streaming: tokens sent to client as generated.
  • Guided generation: grammars, regex, JSON schema via Outlines.
  • OpenAI-compatible API: /v1/chat/completions as an additional layer.

Minimal Deployment

bash
docker run --gpus all --shm-size 1g -p 8080:80 
  -v $PWD/data:/data 
  ghcr.io/huggingface/text-generation-inference:latest 
  --model-id mistralai/Mistral-7B-Instruct-v0.2 
  --quantize bitsandbytes-nf4

After a few minutes loading, the endpoint responds at POST /generate with streamed tokens. On Kubernetes, the official chart covers deployments with GPU operator; pods need nvidia.com/gpu: 1 (or more for tensor parallelism).

NVIDIA logo, manufacturer of the GPUs on which TGI runs in production for large language model inference

Where TGI Shines

TGI is the right option for:

  • Direct Hugging Face Hub models: load by ID, no format conversion.
  • bitsandbytes-quantized models: native NF4, FP4, INT8 support.
  • High-quality SSE streaming: low first-token latency, stable throughput.
  • Hugging Face Inference Endpoints: TGI is the engine behind their managed service.
  • Transformers integration: same tool family, proven compatibility.

If you’re already in the HF ecosystem, TGI is the lowest-friction path.

Where Others Surpass

  • Raw throughput: vLLM[2] squeezes more tokens/second on high-end GPUs thanks to its well-implemented PagedAttention, and is Apache 2.0.
  • CPU or Apple Silicon: llama.cpp[3] / Ollama[4] are better for GPU-less inference.
  • Exotic models: TGI covers the popular ones; less common or very new models may be unsupported.
  • Licence: TGI 2.0+ changed to a restrictive commercial licence for production use. Companies that assumed Apache code are re-evaluating — some have migrated to vLLM because of it.

High-Impact Optimisations

Three small tunings with large effect:

  • --max-batch-prefill-tokens: total token cap in prefill phase (the costliest). Higher means more concurrency but more VRAM.
  • --max-total-tokens: maximum context window per request. Tighter means less memory use.
  • --quantize gptq or --quantize awq: better than bitsandbytes if you have a pre-quantized model.

Measure throughput before and after with locust[5] or vegeta[6] to validate real impact.

OpenAI-Compatible API

bash
curl http://localhost:8080/v1/chat/completions 
  -H "Content-Type: application/json" 
  -d '{
    "model": "tgi",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

This lets you use LangChain[7], LiteLLM[8], and OpenAI SDKs without code changes. The transition from a hosted model to self-hosted goes from weeks to minutes.

Production Operation

Checklist for serious TGI ops:

  • Health checks against /health.
  • Prometheus metrics at /metrics — exposes latency, throughput, and VRAM usage.
  • Limit concurrency at router level to avoid OOM in spikes.
  • Backup the downloaded model: if the hub is unreachable you can’t start.
  • Monitor GPU temperature — GPUs at 90°C degrade and fail.
  • Plan for CUDA/driver updates — TGI is version-sensitive.
PyTorch logo, the deep learning framework used by TGI’s worker for language model inference

Alternatives to Consider

  • vLLM[2]: better general throughput, very active community, Apache 2.0.
  • llama.cpp[3] / Ollama[4]: CPU and Apple Silicon, simpler deployment.
  • TensorRT-LLM[9]: performance ceiling on NVIDIA GPUs, but high operational complexity.
  • LMDeploy[10]: very good performance on certain models.

Conclusion

TGI remains a robust, sensible choice for most teams serving open models: support for popular hub models, easy quantization, familiar API, and HF-ecosystem integration. The 2.0 licence change is the most important thing to evaluate before adopting in commercial contexts. For absolute throughput on top GPUs, vLLM usually wins; for maximum simplicity on CPU, llama.cpp. If you have no licence constraints and are in the HF ecosystem, TGI remains the lowest-friction option.

Was this useful?
[Total: 10 · Average: 4.4]
  1. Text Generation Inference (TGI)
  2. vLLM
  3. llama.cpp
  4. Ollama
  5. locust
  6. vegeta
  7. LangChain
  8. LiteLLM
  9. TensorRT-LLM
  10. LMDeploy

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.