TensorRT-LLM: aceleración en GPUs NVIDIA para LLM

Chip de procesador iluminado en verde neón representando GPU de alta performance

TensorRT-LLM (NVIDIA) es el performance ceiling para LLM inference en GPUs NVIDIA. Highly optimized — kernels custom, quantization avanzada, multi-GPU orchestration sophisticated. Complejidad mayor que vLLM pero 2-3x throughput en casos optimal. Este artículo cubre when vale complexity.

Qué es

  • NVIDIA’s optimization stack para LLMs.
  • Built on TensorRT (GPU inference runtime).
  • Custom kernels: handcrafted para LLM operations.
  • Multi-GPU: tensor y pipeline parallelism.
  • Quantization: FP8, INT8, INT4 avanzadas.
  • H100 / B100 optimizations específicas.

vs vLLM

Aspect TensorRT-LLM vLLM
Throughput Highest (optimal) Very high
Setup Complex Easy
Build time Hours None
Hardware NVIDIA only NVIDIA mostly
Quantization Most advanced Good
Community NVIDIA-centric Broad
Documentation Extensive Accessible

vLLM: easier, broadly compatible. TensorRT-LLM: squeeze every last bit.

Build process

Unlike vLLM (run directly), TensorRT-LLM requires:

  1. Clone model weights.
  2. Convert to TensorRT-LLM format.
  3. Build engine for specific GPU (H100 vs A100 different).
  4. Deploy con Triton Inference Server (typical).
# Clone weights
huggingface-cli download meta-llama/Meta-Llama-3-70B-Instruct

# Convert
python convert_checkpoint.py --model_dir ... --output_dir ...

# Build engine for H100
trtllm-build --checkpoint_dir ... \
             --output_dir engines \
             --max_batch_size 64 \
             --max_input_len 32768

# Serve
tritonserver --model-repository ./triton_model_repo

Hours to first query.

Performance gains

Benchmarks NVIDIA-reported:

  • H100 Llama 3 70B: 2x throughput vs vLLM baseline.
  • H100 Llama 3 8B: 3-4x throughput.
  • Low latency mode: sub-50ms first token.

Real gains vary con workload specifics.

Triton Integration

TensorRT-LLM typically served via NVIDIA Triton:

  • Model ensemble: pre-process + inference + post-process.
  • Dynamic batching.
  • Multi-model hosting.
  • gRPC + HTTP endpoints.

Production-grade serving.

OpenAI-compatible?

Triton + TensorRT-LLM no native OpenAI-compat. Solutions:

  • OpenAI proxy wrapper: implement yourself.
  • LiteLLM: adapter exists.
  • Community: partial wrappers.

Less seamless than vLLM.

Quantization

Options:

  • FP16: baseline.
  • FP8: H100 tensor cores, near-lossless.
  • INT8 SmoothQuant: good quality, significant speedup.
  • INT4 AWQ: aggressive, some quality tradeoff.

Quality testing mandatory — quantization affects outputs.

Hardware-specific builds

Engines son hardware-specific:

  • H100 build no corre en A100.
  • A100 build sub-optimal en H100.
  • Multi-GPU: specific topology assumptions.

Reduces portability. Manage carefully.

Custom features

  • In-flight batching: add requests a ongoing batch.
  • Paged attention: like vLLM.
  • FMHA: fused multi-head attention.
  • Speculative decoding: supported.
  • Medusa heads: drafting acceleration.

Cutting-edge techniques usually appear aquí first.

Complexity cost

Real overhead:

  • Build pipeline: CI/CD tricky.
  • Debugging: NVIDIA tools, not standard.
  • Updates: model updates require rebuild.
  • Team expertise: CUDA knowledge helps.
  • NVIDIA documentation: improved pero dense.

Not casual usage tool.

When TensorRT-LLM wins

  • Maximum throughput critical.
  • Stable model (no frequent updates).
  • NVIDIA hardware exclusive.
  • Team with GPU engineering.
  • Cost-sensitive scale: saving 30% GPU time = big $.

When vLLM wins

  • Quick deployment.
  • Frequent model changes.
  • Smaller team.
  • Good-enough throughput sufficient.
  • Multi-vendor flexibility.

Most teams: vLLM. Few very high-volume: TensorRT-LLM.

Cost examples

Para 10M tokens/día workload:

  • vLLM A100: $200-400/día compute.
  • TensorRT-LLM optimized: $100-200/día compute.

Savings 50% × 365 = big. Para smaller volumes, engineering time doesn’t pay.

Performance tuning

Knobs:

  • Batch size: find sweet spot (usually 16-64 for 70B).
  • Max input/output length: tight = faster.
  • Tensor parallelism: right for model size.
  • Kernel variants: benchmark.

Iterative optimization real.

NVIDIA alternatives

Other NVIDIA inference options:

  • Triton only (without TensorRT-LLM): supports multiple backends.
  • NeMo Inference: enterprise-targeted.
  • NIM (NVIDIA Inference Microservices): turnkey, packaged.

NIM easier entry point para enterprise.

Enterprise support

  • NVIDIA AI Enterprise: support, patches.
  • Price: significant annual cost.
  • Benefits: compliance, SLAs, direct engineering access.

For regulated industries, worth considering.

Open alternatives

  • vLLM: broad.
  • TGI (Hugging Face): matured.
  • LMDeploy (Chinese): competitive en ciertos models.
  • SGLang: structured generation focus.

All simpler que TensorRT-LLM.

Conclusión

TensorRT-LLM es tool para teams serious sobre NVIDIA GPU efficiency en LLM serving. Para most deployments, vLLM is better trade-off. Para very high-volume where squeezing 2x performance matters, TensorRT-LLM pays complexity. Setup cost real — build pipelines, team expertise, update complexity. Evaluate carefully vs simpler alternatives. Para enterprises con NVIDIA AI Enterprise contract, package NIM simplifies adoption significativamente.

Síguenos en jacar.es para más sobre LLM inference, NVIDIA GPU y optimization.

Entradas relacionadas