Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Tecnología

Next-generation NPUs: the hardware moving AI in 2026

Next-generation NPUs: the hardware moving AI in 2026

Actualizado: 2026-05-03

Three years ago, NPUs were a marginal checkbox on a laptop spec sheet. Today they define the real experience when running local models for transcription, summarization, image generation, or code assistance. The leap happened faster than predicted and the 2026 hardware map looks very different from 2024.

Key takeaways

  • An NPU trades versatility for performance per watt: ideal for continuous inference, not training.
  • TOPS metrics mislead on their own; compiler maturity and memory bandwidth determine real performance.
  • Apple, Qualcomm, Intel, and AMD concentrate the four major consumer NPU ecosystems.
  • Continuous inference, models up to 13B parameters, and privacy-sensitive on-device data are the three star use cases.
  • NVIDIA GPUs still dominate training and large models (over 30B).

What an NPU actually is

An NPU (Neural Processing Unit) is an accelerator designed specifically for the operations dominating neural-network inference: matrix-matrix multiplications, convolutions, and activations. Unlike GPUs, which are general-purpose parallel compute accelerators, NPUs trade versatility for much higher performance per watt within their narrow domain. That makes them ideal for continuous or low-latency workloads on devices with limited thermal budget.

The most-published metric is TOPS (tera-operations per second) in INT8 or FP16 precision. It’s useful as order of magnitude but misleading on its own: it captures neither available memory bandwidth nor the efficiency of the compiler translating the model into accelerator instructions. A 40 TOPS chip with a mature compiler beats a 50 TOPS chip with poor software support in practice.

The other factor that matters is unified memory. NPUs sharing memory with CPU and GPU avoid costly copies when the pipeline combines multiple stages, increasingly common because real models are rarely a single isolated neural network.

The four actors dominating 2026

The consumer landscape has consolidated into four main families, each with distinct characteristics and a clear market position.

Apple Neural Engine

Apple leads on ecosystem coherence. The M4 and M5 Neural Engine reaches 38–45 TOPS depending on variant, with unified memory shared with CPU and GPU, and a tool chain (Core ML, MLX, Metal) letting developers move loads between accelerators without rewriting code. Phi-3, Llama 3.2, Mistral Small, and Gemma 2 run comfortably on consumer Macs with sub-100ms-per-token latency.

The weakness is still versatility: the Neural Engine handles standard architectures well but is more rigid than an NVIDIA GPU when the model has uncommon operators.

Qualcomm Hexagon NPU

Qualcomm has gone from quiet mobile leader to the actor defining local AI on Windows ARM laptops with Snapdragon X Elite and X2. The Hexagon NPU reaches 45 TOPS in the current generation and the AI Engine Direct stack integrates well with ONNX Runtime, DirectML, and the new Windows ML. In practice, a Snapdragon X2 laptop runs 7–13 billion parameter models with better battery life than an x86 laptop with comparable discrete GPU for continuous workloads.

Qualcomm’s challenge is software: drivers matured during 2025 but inconsistencies still appear with less popular frameworks.

Intel NPU 4 in Core Ultra

Intel made a notable leap with the NPU 4 in Core Ultra 300: from 11 TOPS in the first generation (2023) to 48 TOPS with major improvements in bandwidth and the OpenVINO compiler. Intel’s clear bet is that developers shouldn’t have to choose between CPU, integrated GPU, or NPU: the OpenVINO runtime picks the optimal route by model and thermal state.

NPU 4 competes well with Apple’s Neural Engine on standard loads, though it trails on energy efficiency during long continuous inference. For corporate environments with majority Windows fleets, Intel is again a reasonable option.

AMD XDNA2 in Ryzen AI

AMD entered the NPU segment later but the XDNA2 integrated in Ryzen AI 300 and 400 arrived with 50 TOPS and good ROCm and ONNX support. Unified memory between CPU, integrated Radeon GPU, and NPU works well for hybrid pipelines, and the software ecosystem has professionalized enough to treat AMD as a viable option. AMD’s strongest point is performance per euro: in midrange laptops, Ryzen AI 350 chips offer inference capacity comparable to pricier solutions with a clear final-price advantage.

Which workloads pay off on NPU

Not all AI workloads are equal. Three types are clearly best on NPU:

  1. Continuous low-latency inference: voice transcription, noise cancellation in calls, real-time camera effects. These run for hours and NPU performance per watt crushes GPU.

  2. Small and medium models (up to 13B parameters in INT4) fitting in device memory. The NPU runs them with low latency without heating the laptop. For local assistants, translation, or short text generation, the experience is qualitatively different from sending every request to the cloud.

  3. Privacy-sensitive inference where data can’t leave the device for legal or contractual reasons. Here the NPU is a direct enabler, not just an optimization.

Where NPUs still lose: training, very large models (over 30B parameters), and workloads with non-standard operators. NVIDIA GPUs or dedicated datacenter accelerators still rule there.

A code example

python
from transformers import pipeline
import torch

# Backend selected by platform: "coreml" (Apple),
# "qnn" (Qualcomm), "openvino" (Intel), or "rocm" (AMD).
pipe = pipeline(
    "text-generation",
    model="microsoft/Phi-3-mini-4k-instruct",
    device="npu",
    torch_dtype=torch.int8,
)

response = pipe("Summarize the following text...", max_new_tokens=128)

The runtime picks the real route. What used to demand hardware-family-specific code now starts hiding behind reasonable common abstractions.

My reading

NPUs have crossed the line between marginal novelty and decisive component. In 2026 a laptop without a competent NPU is an old laptop, not one without extras. For application developers, the decision is no longer whether to embed local AI but which runtime to use so the code leverages the available accelerator without vendor-specific branches.

What remains open is software consolidation. Teams shipping apps with local AI are making pragmatic decisions: use ONNX Runtime with the matching execution provider, add an Apple-specific Core ML layer, and accept that perfect portability will still take a couple of years.

Was this useful?
[Total: 10 · Average: 4.8]

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.