TensorRT-LLM (NVIDIA) is the performance ceiling for LLM inference on NVIDIA GPUs. Highly optimised — custom kernels, advanced quantisation, sophisticated multi-GPU orchestration. More complex than vLLM but 2-3x throughput in optimal cases. This article covers when the complexity pays off.
What It Is
- NVIDIA’s optimisation stack for LLMs.
- Built on TensorRT (GPU inference runtime).
- Custom kernels: handcrafted for LLM operations.
- Multi-GPU: tensor and pipeline parallelism.
- Quantisation: FP8, INT8, INT4 advanced.
- H100 / B100 specific optimisations.
vs vLLM
| Aspect | TensorRT-LLM | vLLM |
|---|---|---|
| Throughput | Highest (optimal) | Very high |
| Setup | Complex | Easy |
| Build time | Hours | None |
| Hardware | NVIDIA only | NVIDIA mostly |
| Quantisation | Most advanced | Good |
| Community | NVIDIA-centric | Broad |
| Documentation | Extensive | Accessible |
vLLM: easier, broadly compatible. TensorRT-LLM: squeeze every last bit.
Build Process
Unlike vLLM (run directly), TensorRT-LLM requires:
- Clone model weights.
- Convert to TensorRT-LLM format.
- Build engine for specific GPU (H100 vs A100 different).
- Deploy with Triton Inference Server (typical).
# Clone weights
huggingface-cli download meta-llama/Meta-Llama-3-70B-Instruct
# Convert
python convert_checkpoint.py --model_dir ... --output_dir ...
# Build engine for H100
trtllm-build --checkpoint_dir ... \
--output_dir engines \
--max_batch_size 64 \
--max_input_len 32768
# Serve
tritonserver --model-repository ./triton_model_repo
Hours to first query.
Performance Gains
NVIDIA-reported benchmarks:
- H100 Llama 3 70B: 2x throughput vs vLLM baseline.
- H100 Llama 3 8B: 3-4x throughput.
- Low latency mode: sub-50ms first token.
Real gains vary with workload specifics.
Triton Integration
TensorRT-LLM typically served via NVIDIA Triton:
- Model ensemble: pre-process + inference + post-process.
- Dynamic batching.
- Multi-model hosting.
- gRPC + HTTP endpoints.
Production-grade serving.
OpenAI-Compatible?
Triton + TensorRT-LLM not native OpenAI-compat. Solutions:
- OpenAI proxy wrapper: implement yourself.
- LiteLLM: adapter exists.
- Community: partial wrappers.
Less seamless than vLLM.
Quantisation
Options:
- FP16: baseline.
- FP8: H100 tensor cores, near-lossless.
- INT8 SmoothQuant: good quality, significant speedup.
- INT4 AWQ: aggressive, some quality tradeoff.
Mandatory quality testing — quantisation affects outputs.
Hardware-Specific Builds
Engines are hardware-specific:
- H100 build doesn’t run on A100.
- A100 build sub-optimal on H100.
- Multi-GPU: specific topology assumptions.
Reduces portability. Manage carefully.
Custom Features
- In-flight batching: add requests to ongoing batch.
- Paged attention: like vLLM.
- FMHA: fused multi-head attention.
- Speculative decoding: supported.
- Medusa heads: drafting acceleration.
Cutting-edge techniques usually appear here first.
Complexity Cost
Real overhead:
- Build pipeline: tricky CI/CD.
- Debugging: NVIDIA tools, not standard.
- Updates: model updates require rebuild.
- Team expertise: CUDA knowledge helps.
- NVIDIA documentation: improved but dense.
Not casual-usage tool.
When TensorRT-LLM Wins
- Maximum throughput critical.
- Stable model (no frequent updates).
- Exclusive NVIDIA hardware.
- Team with GPU engineering.
- Cost-sensitive scale: saving 30% GPU time = big $.
When vLLM Wins
- Quick deployment.
- Frequent model changes.
- Smaller team.
- Sufficient “good-enough” throughput.
- Multi-vendor flexibility.
Most teams: vLLM. Few very-high-volume: TensorRT-LLM.
Cost Examples
For 10M tokens/day workload:
- vLLM A100: $200-400/day compute.
- TensorRT-LLM optimised: $100-200/day compute.
Savings 50% × 365 = big. For smaller volumes, engineering time doesn’t pay.
Performance Tuning
Knobs:
- Batch size: find sweet spot (usually 16-64 for 70B).
- Max input/output length: tight = faster.
- Tensor parallelism: right for model size.
- Kernel variants: benchmark.
Real iterative optimisation.
NVIDIA Alternatives
Other NVIDIA inference options:
- Triton only (without TensorRT-LLM): supports multiple backends.
- NeMo Inference: enterprise-targeted.
- NIM (NVIDIA Inference Microservices): turnkey, packaged.
Easier NIM entry point for enterprise.
Enterprise Support
- NVIDIA AI Enterprise: support, patches.
- Price: significant annual cost.
- Benefits: compliance, SLAs, direct engineering access.
For regulated industries, worth considering.
Open Alternatives
- vLLM: broad.
- TGI (Hugging Face): matured.
- LMDeploy (Chinese): competitive on certain models.
- SGLang: structured generation focus.
All simpler than TensorRT-LLM.
Conclusion
TensorRT-LLM is tool for teams serious about NVIDIA GPU efficiency in LLM serving. For most deployments, vLLM is better trade-off. For very high-volume where squeezing 2x performance matters, TensorRT-LLM pays the complexity. Real setup cost — build pipelines, team expertise, update complexity. Evaluate carefully vs simpler alternatives. For enterprises with NVIDIA AI Enterprise contract, NIM package significantly simplifies adoption.
Follow us on jacar.es for more on LLM inference, NVIDIA GPU, and optimisation.