NPU in the PC: faster, cheaper local AI
Actualizado: 2026-05-03
A couple of years ago, talking about an NPU in a normal PC was a rarity. Today, if you buy a mid-range laptop, it almost certainly has one. Qualcomm put it in first with the Snapdragon X, Intel followed with Lunar Lake, and AMD closed the loop with Strix Point APUs. Microsoft used the moment to create the Copilot+ PC category, which requires at least 40 TOPS of NPU and has finished pushing the market in that direction.
The question I want to answer here isn’t whether NPUs exist, but whether they actually change anything for people wanting to run AI models locally. I’ve spent several months testing different setups, and the answer is more nuanced than the announcements suggest.
Key takeaways
- Current NPUs are optimised for inference with quantised weights in INT8 or INT4 — they are not general-purpose GPUs.
- A 40 TOPS NPU can run quantised Phi-3 Mini consuming 5-10 W, vs 40-50 W for a laptop GPU doing the same task.
- The software ecosystem is the weak point: QNN, OpenVINO and ROCm are different runtimes requiring specifically optimised models.
- Ollama still doesn’t leverage NPUs in most configurations; it goes through CPU or GPU.
- Where NPU makes a real difference is in ultrabooks and silent mini PCs: continuous inference without a fan.
What an NPU does well
Current NPUs are optimized for neural-network inference with quantized weights, typically INT8 or INT4. They aren’t general-purpose processors or GPU replacements: they’re accelerators specialized in low-precision matrix multiplications with minimal power draw.
That’s where they win. A 40 TOPS NPU can run a small language model — say, Phi-3 Mini quantized — at a perfectly usable speed while drawing 5 to 10 watts. A laptop GPU would do the same task faster, but consume four or five times more energy, with the fan spinning and the battery visibly dropping.
The second scenario where they shine is computer vision: object detection, segmentation, face recognition on local cameras, real-time video filtering.
Where they don’t reach
It’s important not to confuse NPUs with a GPU replacement for heavier workloads. A 7B-parameter model running at Q4 precision will consume 4 to 5 GB of memory, and while the NPU can handle it, the available NPU memory (shared with the system in most architectures) limits the practical size. With 16 GB total RAM, running a 7B while working on something else is uncomfortable.
They also aren’t designed for training, or serious fine-tuning. When announcements talk about “AI on your PC”, the assumed scenario is always inference on pre-trained models.
The state of software
This is the point where things are less mature. Having an NPU doesn’t mean any app will use it automatically. Each vendor exposes its NPU via its own runtime:
- QNN on Qualcomm.
- OpenVINO on Intel.
- ROCm/Ryzen AI on AMD.
Interop has improved a lot with ONNX Runtime, which abstracts the three platforms, but the reality is many frameworks and apps still assume CPU or GPU and ignore the NPU entirely.
To run an LLM locally, the most practical options today are:
- LM Studio: has started supporting Qualcomm NPUs via QNN.
- ONNX Runtime with optimized models: direct but more technical route.
- Ollama: still doesn’t leverage NPUs in most setups; goes through CPU or GPU.
When a PC with NPU is worth picking
My practical recommendation: if you buy a new laptop, almost any mid- or high-range machine will ship with an NPU, so the real question is how to weigh it.
If you’ll be using local AI models routinely — assistants always running, real-time video processing, audio transcription — the NPU adds value: it extends battery and lets you keep the feature on without noticeable impact.
If your local-AI usage is occasional — load a model, run tests, generate some text — a reasonable GPU is still more versatile.
Where I would pick carefully is in small, silent machines: ultrabooks, mini PCs, tablets. There, the NPU makes a real difference. A Snapdragon X Elite can run a decent Phi-3 without a fan, and no integrated GPU matches that.
Looking a bit ahead
What I think will happen over the next two years is that the ecosystem will level out. Runtimes will converge, popular frameworks will start exposing the NPU as a default option, and small models specifically designed for NPUs will become the natural use case. The transition echoes GPUs for scientific computing fifteen years ago: early on they needed special compilers and rewritten code; today anyone uses them without thinking.
Conclusion
NPUs in consumer PCs are hardware ahead of their software, but that gap is closing. If you already have one in your machine, it’s worth knowing it’s there and starting to try what it enables. For new purchases in 2025, the NPU isn’t the deciding factor in most cases, but in silent ultrabooks or mini PCs where energy efficiency is critical, it does tip the balance. Getting to know it today is a reasonable afternoon investment, and it saves you from ignoring, a year from now, an accelerator you already have at home.