For a couple of years, the NPU acronym was mostly a label on laptop boxes and a checkbox on processor specs. In 2025 that has changed enough to warrant an honest review: what hardware is available, what tools let you use it from real code, which workloads pay off and which are still better on CPU or GPU. The landscape is neither uniform nor finished, but there’s enough to let a developer decide with judgment whether to spend time integrating an NPU into a specific product.
What’s on the market
The three dominant laptop families are Qualcomm Snapdragon X (Elite and Plus), Apple Silicon from M1 onward, and AMD Ryzen AI 300 with XDNA. Intel entered later with Core Ultra Meteor Lake and Lunar Lake, which include their own NPU but with a less-polished software ecosystem. The numbers vendors advertise revolve around TOPS, operations per second at low precision, and reach 45 on Snapdragon X Elite, 38 on Apple M4, 50 on Ryzen AI 300, and 48 on Lunar Lake.
TOPS are an easy number to compare but a misleading one. What matters in practice is the combination of raw capacity, supported precision, memory bandwidth, and the software stack available to reach the silicon. A chip with 45 theoretical TOPS and immature tooling delivers less real inference than one with 30 TOPS and a polished toolchain. Servers and workstations also have NPUs in some systems, but the natural focus in 2025 is the laptop, because that’s where the use case is clearest and where thermal and battery limits matter.
The toolchain: ONNX Runtime as common denominator
The element that made it realistic to talk about NPUs for developers is ONNX Runtime with its vendor-specific execution providers. Qualcomm has QNN EP; Apple has CoreML EP; AMD has Vitis AI EP; Intel has OpenVINO EP. They all follow the same pattern: take an ONNX model and dispatch part of the graph to the NPU, leaving the rest on CPU or GPU. Support isn’t uniform and some operators don’t translate, but for common vision and language-processing models, coverage is sufficient.
Each vendor also has its own chain. Apple offers Core ML with the coremltools compiler that converts models from PyTorch or ONNX and produces native packages. AMD has Ryzen AI Software with a Vitis AI-based flow that compiles quantized models to INT8. Qualcomm provides the AI Engine Direct SDK with conversion utilities to its QNN binary format. Intel pushes OpenVINO, which besides its NPU supports CPU and integrated GPU with the same API.
The practical decision for a developer wanting to cover multiple platforms is to start with ONNX Runtime. A well-exported ONNX model can run on CPU, GPU and the four main NPUs with minimal inference-code changes. Quantization to INT8 or even lower is almost always required: most NPUs are integer-oriented and getting the most out of them requires reducing model precision at export, not at load.
What you can do today
The best-solved use case today is lightweight vision inference. Object detection, image classification, segmentation, face recognition, all run well on any current NPU with latencies of tens of milliseconds and significantly lower energy than on the integrated GPU. For desktop apps that process real-time camera video or analyze user images, the NPU is today’s natural choice.
The second mature case is audio transcription. Whisper in its small and medium variants runs reasonably well on NPU after proper quantization, and apps like live captions or voice notes benefit a lot from reduced energy cost versus running the model on CPU or GPU. Apple has very polished Whisper support on the Neural Engine via Core ML; other vendors have caught up during 2025 with varying quality.
The third case, more recent and more ambitious, is small language models. Phi-3 Mini, Llama 3.2 1B and 3B, Qwen 2.5 in the few-billion-parameter range with INT4 quantization already run on current NPUs at a tokens-per-second rate that’s starting to be useful for summarization, text correction or local assistants. It’s not the territory where a laptop NPU competes with a datacenter GPU; it’s the territory where it competes with running the same model on CPU, and there the NPU usually wins clearly in both latency and energy.
The fourth, more speculative, is small image-generation models. Stable Diffusion in distilled variants, like the Turbo or Lightning models, works on decent NPUs with per-image generation times of a few seconds at moderate sizes. Quality doesn’t match a dedicated GPU, but for personal use or app integration, the quality-per-energy ratio is getting interesting.
Where it doesn’t pay off yet
Not everything is favorable ground. Large models remain GPU or abundant-memory CPU territory. A 13-billion-parameter model doesn’t fit in the memory accessible to a laptop NPU, or fits heavily quantized with degraded quality, and the integrated GPU with access to unified memory usually wins. The same applies to large diffusion models, training tasks (no consumer NPU trains today, they’re all inference), and loads with complex control flow that don’t compile well to the static graph NPUs expect.
Neither does it pay off when inference happens on a server and the client only makes HTTP requests. There the client hardware is irrelevant and the question doesn’t arise. The NPU’s ground is local inference, and if your architecture doesn’t contemplate local execution, the NPU is a non-issue.
A detail that surprises people approaching this for the first time is that running a model on NPU is often slower than on integrated GPU for the first invocation, because of loading and compilation cost. The benefit shows up in repeated runs or in long-running scenarios, where energy efficiency offsets the initial latency. This must be factored into the app experience design.
Minimal ONNX Runtime example
For a developer wanting to try today, the short path is to export a model to ONNX from PyTorch, quantize it, and load it with the corresponding execution provider. Python code looks quite similar across platforms once the environment is ready.
On Apple you switch the provider to CoreMLExecutionProvider, on AMD to VitisAIExecutionProvider, on Intel to OpenVINOExecutionProvider. The idea is that the same model and nearly the same code run on all four, and that if something fails on the NPU the runtime falls back to CPU automatically. Reality has more corners, but the abstraction is a useful starting point.
My reading
After following this space for two years, I think laptop NPUs are today a real tool but not a magic bullet. For specific cases, lightweight vision, audio, small language models, they clearly win on latency and energy. For large cases, they don’t. The toolchain has matured enough that a developer with inference experience can integrate an NPU into a product in weeks, not months, as long as they accept quantization limits and spend time on target-vendor specifics.
The most common blind spot I see in teams is assuming TOPS numbers translate directly into performance. They don’t. What translates into performance is the match between model, quantization, supported operator graph and available memory bandwidth. A team that measures on their own use case, on the target hardware, with real data, quickly discovers which of the four platforms suits them and, above all, whether the NPU suits them over the integrated GPU, which in many cases is still more predictable and more flexible.
The direction of travel, however, is clear: each laptop generation reduces local-inference cost, small models gain capability, and the NPU consolidates as the natural spot to run them. Aiming there in 2026 is a reasonable bet for products that want to run offline or with low perceived latency. What was marketing is becoming infrastructure; time to learn to use it.