Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Tecnología

NVIDIA Blackwell GPUs: what changes for training

NVIDIA Blackwell GPUs: what changes for training

Actualizado: 2026-05-03

NVIDIA’s Blackwell architecture was unveiled at GTC 2024, and the first commercial systems started reaching select hyperscalers in late 2024. By September 2025 there are enough real deployments, enough public measurements, and enough operational experience to evaluate what actually changes for large-model training. This review focuses on GB200 NVL72, where Blackwell’s logic is expressed fully.

Key takeaways

  • NVIDIA has shifted from designing GPUs to designing racks: the GB200 NVL72 integrates 72 Blackwell GPUs and 36 Grace CPUs in ~120 kW, presenting to software as a single machine with 13.5 TB of HBM3e.
  • Early MLPerf Training v5.0 results show 2.2×–2.8× faster Llama 3 70B training than H100 — not the synthetic 4× NVIDIA claims, but still significant.
  • Native FP4 is the major inference enabler: a model in FP4 occupies ¼ the space and uses ¼ the memory bandwidth — but not all models quantize cleanly to 4 bits.
  • A GB200 NVL72 rack costs ~$3 M with a 9-month lead time; practical access in 2025 is via cloud (CoreWeave, Azure, AWS, GCP) at $6–12/GPU-hour.
  • For most companies training their own models, Blackwell only pays off if training time is the dominant bottleneck.

The core idea: the rack as unit

The most distinctive thing about Blackwell is not the GPU itself, even though the leap over Hopper is substantial. It is that NVIDIA has stopped designing GPUs and started designing racks. GB200 NVL72 integrates 72 Blackwell GPUs and 36 Grace CPUs in a single roughly 120-kilowatt cabinet, interconnected by fifth-generation NVLink at 1.8 terabytes per second. To the software, the rack behaves as a single machine with 13.5 terabytes of unified HBM3e memory.

This is a significant conceptual shift. With Hopper and earlier generations, model parallelism required partitioning the model across GPUs and explicitly managing inter-GPU communication through NCCL. With GB200 NVL72, 72 GPUs can access each other’s memory as if local, albeit with higher latency. This simplifies training patterns where communication between partitions is intense, such as tensor parallelism or mixture-of-experts models.

What the industry is measuring for real performance

NVIDIA’s synthetic numbers point to 4x training throughput over H100 at the same precision, and up to 30x inference speedup with mixed FP4. In practice, early customers report more moderate but still significant gains. Meta and Microsoft published MLPerf Training v5.0 results showing GB200 NVL72 training Llama 3 70B between 2.2x and 2.8x faster than equivalent H100 per GPU at the same power budget.

The gap between theoretical 4x and real 2.5x is interesting. Part comes from synthetic benchmarks assuming FP8 or FP4 precision, while real training still uses a lot of BF16 for stability. Part comes from NVIDIA’s Blackwell software stack still maturing.

The other relevant number is power. A Blackwell B200 draws 1,000 watts versus 700 for H100. A full GB200 NVL72 rack sits at 120 kilowatts, about 10x a traditional general-purpose server rack. This forces direct liquid cooling at the chip, something that was niche in 2023 and has become the default for AI data centers by 2025.

NVIDIA GB200 NVL72 rack at COMPUTEX 2024 — the system that shifts the purchase unit from individual GPU to full cabinet with liquid cooling and 120 kW density, redefining AI data center requirements

FP4 and quantization

One of Blackwell’s technical novelties is native FP4 support, a 4-bit-per-element precision absent in Hopper. FP4 is not useful for training, where gradients need more precision to converge, but it is very useful for inference on large models. A model trained in BF16 and quantized to FP4 occupies a quarter of the space, uses a quarter of the memory bandwidth, and runs much faster if hardware supports it natively.

The catch is that not all models quantize well at 4 bits. For large dense transformers like Llama or Mistral, FP4 with careful calibration loses between 0.5 and 2 points on standard benchmarks. The current recommendation is FP8 for high-quality inference and FP4 where cost matters more than the last points of quality.

This has a practical implication: Blackwell is more attractive to operators with massive inference workloads than to operators who only train. Hyperscalers have both workloads and can amortize the hardware across them.

Software: still evolving

The software ecosystem has two levels. At the low level, CUDA 12.5 and cuDNN 9 already include stable support, as does NCCL 2.22. At the high level, popular frameworks lag behind. PyTorch 2.5 added initial Blackwell support, but real optimizations landed in 2.6 in January 2025. JAX followed a similar path.

By September 2025 the ecosystem is mature for standard model training, but edge cases remain. Serious teams keep a parallel Hopper environment for cross-checks.

Price, availability, and alternatives

In September 2025, a GB200 NVL72 rack costs around $3 million for a direct buyer, with 9-month lead times except for preferred large customers. This means that during 2025 and much of 2026, Blackwell access is essentially via cloud: CoreWeave, Lambda, Crusoe, Azure, AWS, and Google Cloud offer instances at $6 to $12 per GPU-hour depending on region and commitment.

Real alternatives are few: AMD Instinct MI300X has gained share but ROCm still trails CUDA; Google TPU v5p remains competitive but cloud-only; Intel Gaudi 3 has fallen to a distant third.

When it pays off

For most companies training their own models, Blackwell only makes sense if training time is the dominant bottleneck. For inference, it depends on volume. At low volume, Hopper remains cheaper per query because Blackwell is overprovisioned. At high volume, especially with FP4 quantization, Blackwell can cut cost per query by a factor of 3 to 5.

My take

Blackwell marks a phase change in AI infrastructure worth understanding even if you never touch one of these racks directly. When NVIDIA designs the product as a rack rather than a GPU, it is nudging the industry toward a model where the purchase and operations unit is the full cabinet, with liquid cooling and 120 kilowatts of density.

For application builders, the relevant part is that the largest models will be available faster and cheaper via API. Frontier training cost keeps growing, but unit query cost keeps falling. That favors strategies consuming third-party models through APIs over strategies training bespoke models. Only cases where the model itself is the competitive edge, or where data cannot leave the organization, justify training.

For operators of traditional infrastructure, Blackwell is an invitation to think about which parts of the data-center stack still belong to the general-purpose future and which belong to specialized AI workloads. Over the next five years data centers will likely split into two radically different profiles, and mixing them in the same building will stop making sense.

Was this useful?
[Total: 11 · Average: 4.5]

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.