NPUs for developers: what you can actually do today
Actualizado: 2026-05-03
For a couple of years, the NPU acronym was mostly a label on laptop boxes. In 2025 that has changed enough to warrant an honest review: what hardware is available, what tools let you use it from real code, which workloads pay off and which are still better on CPU or GPU.
Key takeaways
- The three dominant laptop families are Qualcomm Snapdragon X (Elite and Plus), Apple Silicon from M1 onward, and AMD Ryzen AI 300 with XDNA.
- TOPS are an easy number to compare but a misleading one: what matters in practice is the combination of raw capacity, supported precision, memory bandwidth, and the software stack available.
- The most useful common denominator for developers wanting to cover multiple platforms is ONNX Runtime with vendor-specific execution providers.
- Best-solved use cases today are lightweight vision, audio transcription, and small language models with INT4 quantization.
- Running a model on NPU is often slower than on integrated GPU for the first invocation because of loading and compilation cost; the benefit shows up in repeated runs.
What’s on the market
The three dominant laptop families are Qualcomm Snapdragon X (Elite and Plus), Apple Silicon from M1 onward, and AMD Ryzen AI 300 with XDNA. Intel entered later with Core Ultra Meteor Lake and Lunar Lake. TOPS numbers: 45 on Snapdragon X Elite, 38 on Apple M4, 50 on Ryzen AI 300, and 48 on Lunar Lake. TOPS are an easy number to compare but a misleading one.
The toolchain: ONNX Runtime as common denominator
The element that made it realistic to talk about NPUs for developers is ONNX Runtime with vendor-specific execution providers: Qualcomm’s QNN EP, Apple’s CoreML EP, AMD’s Vitis AI EP, Intel’s OpenVINO EP. The practical decision for a developer wanting to cover multiple platforms is to start with ONNX Runtime. Quantization to INT8 or INT4 is almost always required: most NPUs are integer-oriented.
What you can do today
The best-solved use case is lightweight vision inference. Object detection, image classification, segmentation, face recognition, all run well on any current NPU with tens-of-milliseconds latency and significantly lower energy than on integrated GPU.
The second mature case is audio transcription. Whisper in small and medium variants runs reasonably well on NPU after proper quantization.
The third case, more recent and more ambitious, is small language models. Phi-3 Mini, Llama 3.2 1B and 3B, Qwen 2.5 in few-billion-parameter range with INT4 quantization already run on current NPUs at a tokens-per-second rate useful for summarization, text correction or local assistants. It’s not the territory where a laptop NPU competes with a datacenter GPU; it’s where it competes with running the same model on CPU, and there the NPU usually wins on both latency and energy.
Where it doesn’t pay off yet
Large models remain GPU or abundant-memory CPU territory. No consumer NPU trains today, they’re all inference. Models with complex control flow that don’t compile well to the static graph NPUs expect are also better left on CPU or GPU.
A detail that surprises people approaching this for the first time: running a model on NPU is often slower than on integrated GPU for the first invocation because of loading and compilation cost. The benefit shows up in repeated runs.
Minimal ONNX Runtime example
import onnxruntime as ort
import numpy as np
# On a Snapdragon X Elite laptop
providers = [
("QNNExecutionProvider", {"backend_path": "QnnHtp.dll"}),
"CPUExecutionProvider",
]
session = ort.InferenceSession("quantized_model.onnx", providers=providers)
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
output = session.run(None, {"input": input_data})On Apple switch to CoreMLExecutionProvider, on AMD to VitisAIExecutionProvider, on Intel to OpenVINOExecutionProvider. If something fails on the NPU the runtime falls back to CPU automatically.
Conclusion
What was marketing is becoming infrastructure; time to learn to use it. The correct evaluation is always on the target hardware, with your own model, and with real data; vendor benchmarks are starting points, not decisions.