Per page Year

TensorRT-LLM: Extreme Acceleration on NVIDIA GPUs for LLMs

TensorRT-LLM es el techo de rendimiento para inferencia LLM en NVIDIA. Complejo de desplegar pero 2-3x más rápido que vLLM en casos óptimos. Cuándo merece la complejidad.

65 11 min November 4, 2024

Arquitectura

vLLM: Serving LLMs in Production with Very High Throughput

vLLM se ha convertido en la referencia para servir LLM en GPU. PagedAttention, batching continuo y API compatible con OpenAI. Cómo desplegarlo bien y cuándo compensa.

63 13 min October 5, 2024 4.5