TensorRT-LLM: NVIDIA GPU Acceleration for LLMs

Chip de procesador iluminado en verde neón representando GPU de alta performance

TensorRT-LLM (NVIDIA) is the performance ceiling for LLM inference on NVIDIA GPUs. Highly optimised — custom kernels, advanced quantisation, sophisticated multi-GPU orchestration. More complex than vLLM but 2-3x throughput in optimal cases. This article covers when the complexity pays off.

What It Is

  • NVIDIA’s optimisation stack for LLMs.
  • Built on TensorRT (GPU inference runtime).
  • Custom kernels: handcrafted for LLM operations.
  • Multi-GPU: tensor and pipeline parallelism.
  • Quantisation: FP8, INT8, INT4 advanced.
  • H100 / B100 specific optimisations.

vs vLLM

Aspect TensorRT-LLM vLLM
Throughput Highest (optimal) Very high
Setup Complex Easy
Build time Hours None
Hardware NVIDIA only NVIDIA mostly
Quantisation Most advanced Good
Community NVIDIA-centric Broad
Documentation Extensive Accessible

vLLM: easier, broadly compatible. TensorRT-LLM: squeeze every last bit.

Build Process

Unlike vLLM (run directly), TensorRT-LLM requires:

  1. Clone model weights.
  2. Convert to TensorRT-LLM format.
  3. Build engine for specific GPU (H100 vs A100 different).
  4. Deploy with Triton Inference Server (typical).
# Clone weights
huggingface-cli download meta-llama/Meta-Llama-3-70B-Instruct

# Convert
python convert_checkpoint.py --model_dir ... --output_dir ...

# Build engine for H100
trtllm-build --checkpoint_dir ... \
             --output_dir engines \
             --max_batch_size 64 \
             --max_input_len 32768

# Serve
tritonserver --model-repository ./triton_model_repo

Hours to first query.

Performance Gains

NVIDIA-reported benchmarks:

  • H100 Llama 3 70B: 2x throughput vs vLLM baseline.
  • H100 Llama 3 8B: 3-4x throughput.
  • Low latency mode: sub-50ms first token.

Real gains vary with workload specifics.

Triton Integration

TensorRT-LLM typically served via NVIDIA Triton:

  • Model ensemble: pre-process + inference + post-process.
  • Dynamic batching.
  • Multi-model hosting.
  • gRPC + HTTP endpoints.

Production-grade serving.

OpenAI-Compatible?

Triton + TensorRT-LLM not native OpenAI-compat. Solutions:

  • OpenAI proxy wrapper: implement yourself.
  • LiteLLM: adapter exists.
  • Community: partial wrappers.

Less seamless than vLLM.

Quantisation

Options:

  • FP16: baseline.
  • FP8: H100 tensor cores, near-lossless.
  • INT8 SmoothQuant: good quality, significant speedup.
  • INT4 AWQ: aggressive, some quality tradeoff.

Mandatory quality testing — quantisation affects outputs.

Hardware-Specific Builds

Engines are hardware-specific:

  • H100 build doesn’t run on A100.
  • A100 build sub-optimal on H100.
  • Multi-GPU: specific topology assumptions.

Reduces portability. Manage carefully.

Custom Features

  • In-flight batching: add requests to ongoing batch.
  • Paged attention: like vLLM.
  • FMHA: fused multi-head attention.
  • Speculative decoding: supported.
  • Medusa heads: drafting acceleration.

Cutting-edge techniques usually appear here first.

Complexity Cost

Real overhead:

  • Build pipeline: tricky CI/CD.
  • Debugging: NVIDIA tools, not standard.
  • Updates: model updates require rebuild.
  • Team expertise: CUDA knowledge helps.
  • NVIDIA documentation: improved but dense.

Not casual-usage tool.

When TensorRT-LLM Wins

  • Maximum throughput critical.
  • Stable model (no frequent updates).
  • Exclusive NVIDIA hardware.
  • Team with GPU engineering.
  • Cost-sensitive scale: saving 30% GPU time = big $.

When vLLM Wins

  • Quick deployment.
  • Frequent model changes.
  • Smaller team.
  • Sufficient “good-enough” throughput.
  • Multi-vendor flexibility.

Most teams: vLLM. Few very-high-volume: TensorRT-LLM.

Cost Examples

For 10M tokens/day workload:

  • vLLM A100: $200-400/day compute.
  • TensorRT-LLM optimised: $100-200/day compute.

Savings 50% × 365 = big. For smaller volumes, engineering time doesn’t pay.

Performance Tuning

Knobs:

  • Batch size: find sweet spot (usually 16-64 for 70B).
  • Max input/output length: tight = faster.
  • Tensor parallelism: right for model size.
  • Kernel variants: benchmark.

Real iterative optimisation.

NVIDIA Alternatives

Other NVIDIA inference options:

  • Triton only (without TensorRT-LLM): supports multiple backends.
  • NeMo Inference: enterprise-targeted.
  • NIM (NVIDIA Inference Microservices): turnkey, packaged.

Easier NIM entry point for enterprise.

Enterprise Support

  • NVIDIA AI Enterprise: support, patches.
  • Price: significant annual cost.
  • Benefits: compliance, SLAs, direct engineering access.

For regulated industries, worth considering.

Open Alternatives

  • vLLM: broad.
  • TGI (Hugging Face): matured.
  • LMDeploy (Chinese): competitive on certain models.
  • SGLang: structured generation focus.

All simpler than TensorRT-LLM.

Conclusion

TensorRT-LLM is tool for teams serious about NVIDIA GPU efficiency in LLM serving. For most deployments, vLLM is better trade-off. For very high-volume where squeezing 2x performance matters, TensorRT-LLM pays the complexity. Real setup cost — build pipelines, team expertise, update complexity. Evaluate carefully vs simpler alternatives. For enterprises with NVIDIA AI Enterprise contract, NIM package significantly simplifies adoption.

Follow us on jacar.es for more on LLM inference, NVIDIA GPU, and optimisation.

Entradas relacionadas