TensorRT-LLM: Extreme Acceleration on NVIDIA GPUs for LLMs
Actualizado: 2026-05-03
TensorRT-LLM[1] is NVIDIA’s bet on extracting the last percentage of performance from its GPUs for language model inference. Where vLLM optimises memory usage and request scheduling, TensorRT-LLM goes one level lower: hand-written CUDA kernels for each critical operation, native FP8 and INT4 advanced quantization, and sophisticated multi-GPU orchestration that squeezes the NVLink interconnect between H100s. The price is much higher build and deployment complexity. The relevant question is not whether it is faster — it generally is, 2x-3x in optimal cases — but when that difference justifies the operational cost.
Key takeaways
- TensorRT-LLM compiles the model into an optimised engine specific to the target GPU and batch size; it does not serve raw weights.
- Custom CUDA kernels for attention, feed-forward, and norm operations are the primary source of advantage over vLLM.
- Native FP8 on H100 is the biggest gain: 2x theoretical BF16 throughput with minimal quality loss.
- The build process can take 30-90 minutes and the resulting engine is tied to the exact hardware it was compiled on.
- The advantage shrinks with smaller models or non-H100 hardware; on A100 with BF16, the difference from vLLM is smaller.
What makes TensorRT-LLM different
vLLM and TGI optimise at the software layer: how to manage memory, how to schedule requests, how to minimise idle GPU time. TensorRT-LLM operates at a lower layer: how each mathematical operation executes on the silicon.
Concrete differences:
- Attention kernels: NVIDIA has written FlashAttention-2 and FlashMHA implementations optimised for each GPU architecture. Not PyTorch’s generic kernels — CUDA code specific to each operation and tensor size.
- Native FP8 quantization: H100s have specific hardware units for FP8 arithmetic. TensorRT-LLM uses them directly; other frameworks do software emulation.
- Kernel fusion: operations that normally generate two or three separate kernel launches (norm + multiply + activation) are fused into a single kernel, eliminating synchronisation overhead and reducing memory movements.
- Execution plan calibration: unlike vLLM, which dynamically decides how to serve each request, TensorRT-LLM compiles the model against a specific usage profile (maximum batch size, maximum sequence length), enabling optimisations impossible at dynamic runtime.
The build process
TensorRT-LLM’s build has three stages that differentiate its usage from vLLM’s:
1. Weight conversion: model weights in HuggingFace format are converted to TensorRT-LLM’s internal format, applying quantization if needed.
python convert_checkpoint.py
--model_dir /models/llama-3-8b
--output_dir /checkpoints/llama-3-8b-fp8
--dtype float16
--use_fp8
--fp8_kv_cache2. Engine compilation: TensorRT compiles the operation graph against the specific hardware. This step can take 30-90 minutes for large models.
trtllm-build
--checkpoint_dir /checkpoints/llama-3-8b-fp8
--output_dir /engines/llama-3-8b-h100
--gemm_plugin float16
--gpt_attention_plugin float16
--max_batch_size 64
--max_input_len 4096
--max_output_len 2048
--tp_size 23. Serving: the compiled engine is served with Triton Inference Server or TensorRT-LLM’s native server.
The resulting engine is tied to the exact hardware it was compiled on: an engine compiled for H100 does not run on A100, and one compiled for batch size 64 does not perform well for batch size 128. This means the build process must be repeated for each hardware and load configuration.
Real performance: when it pays
Performance differences depend strongly on hardware and model. Scenarios where TensorRT-LLM wins most:
- H100 with FP8: the optimal case. H100’s FP8 hardware units double the theoretical BF16 throughput. With TensorRT-LLM compiled for FP8, results can be 2x-3x superior to vLLM in BF16.
- Large models on multi-GPU: for 70B+ models on 4-8 H100 configurations with NVLink, optimising inter-GPU traffic is significant.
- Known fixed batch sizes: if your load has predictable length and batch size distributions, compiling the engine against those specific profiles extracts more performance.
Scenarios where the advantage shrinks:
- A100 with BF16: without native FP8 hardware, the difference from vLLM falls to 1.2x-1.5x — often not justifying the operational cost.
- Small models (≤7B): TensorRT-LLM’s custom kernels give less advantage when models fit easily in memory.
- Variable length distribution: if requests vary widely in length, the engine compiled for a specific profile does not squeeze the hardware as much.
TensorRT-LLM vs vLLM: the practical decision
The decision between TensorRT-LLM and vLLM is fundamentally about where the bottleneck is in your system and how much you can invest in operations:
| Factor | vLLM | TensorRT-LLM |
|---|---|---|
| Performance on H100 FP8 | Baseline | +2x-3x |
| Performance on A100 BF16 | Baseline | +20-50% |
| Initial deployment time | < 1h | 2-4h |
| Configuration flexibility | High | Low |
| Model updates | Immediate | Requires recompilation |
| Model support | Broad | Selective |
| Operational needs | Low | High |
The practical recommendation: start with vLLM. If vLLM’s performance on your specific hardware is insufficient — because you have measured that the cost per token is too high or latency does not meet the SLA — evaluate TensorRT-LLM for your specific model and hardware. Do not assume the difference will be 2x-3x; measure with your actual traffic and specific model.
For efficiently fine-tuning models before serving them with TensorRT-LLM, LoRA and QLoRA adapters work well: TensorRT-LLM supports fusing LoRA adapters into the compiled engine.
Conclusion
TensorRT-LLM is the right option when NVIDIA GPU performance is the limiting factor and the team has the operational capacity to manage the compilation cycle. For most teams starting to serve their own LLMs, vLLM offers 70-80% of the performance with 10% of the complexity. The TensorRT-LLM use case is the team that already has vLLM in production, has measured that cost per token does not meet the financial target, and has H100 hardware where FP8 can make the difference.