A year ago, talking about an NPU in a normal PC was a rarity. Today, if you buy a mid-range laptop, it almost certainly has one. Qualcomm put it in first with the Snapdragon X, Intel followed with Lunar Lake, and AMD closed the loop with Strix Point APUs. Microsoft used the moment to create the Copilot+ PC category, which requires at least 40 TOPS of NPU and has finished pushing the market in that direction.
The question I want to answer here isn’t whether NPUs exist, but whether they actually change anything for people wanting to run AI models locally. I’ve spent several months testing different setups, and the answer is more nuanced than the announcements suggest.
What an NPU does well
Current NPUs are optimized for neural-network inference with quantized weights, typically INT8 or INT4. They aren’t general-purpose processors or GPU replacements: they’re accelerators specialized in low-precision matrix multiplications with minimal power draw.
That’s where they win. A 40 TOPS NPU can run a small language model (say, Phi-3 Mini quantized) at a perfectly usable speed while drawing 5 to 10 watts. A laptop GPU would do the same task faster, but consume four or five times more energy, with the fan spinning and the battery visibly dropping.
The second scenario where they shine is computer vision. Object detection, segmentation, face recognition on local cameras, real-time video filtering — anything involving processing streams of data with relatively small models is a glove fit.
Where they don’t reach
It’s important not to confuse NPUs with a GPU replacement for heavier workloads. A 7B-parameter model running at Q4 precision will consume 4 to 5 GB of memory, and while the NPU can handle it, the available NPU memory (shared with the system in most architectures) limits the practical size. With 16 GB total RAM, running a 7B while working on something else is uncomfortable.
They also aren’t designed for training, or serious fine-tuning. When announcements talk about “AI on your PC”, the assumed scenario is always inference on pre-trained models. Anything involving actual training still requires a GPU, ideally with dedicated VRAM.
The state of software
This is the point where things are less mature. Having an NPU doesn’t mean any app will use it automatically. Each vendor exposes its NPU via its own runtime: QNN on Qualcomm, OpenVINO on Intel, ROCm/Ryzen AI on AMD. Interop has improved a lot with ONNX Runtime, which abstracts the three platforms, but the reality is many frameworks and apps still assume CPU or GPU and ignore the NPU entirely.
Some products are already optimized: Windows Copilot features (camera effects, local translation, automatic captions) leverage the NPU on Copilot+ PCs. The models Microsoft packages in Windows for content recap also do. Outside that circle, the ecosystem lags.
To run an LLM locally, the most practical options today are LM Studio (which has started supporting Qualcomm NPUs via QNN) and the direct path of ONNX Runtime with optimized models. Ollama, which is what most of us probably use, still doesn’t leverage NPUs in most setups; it goes through CPU or GPU.
When a PC with NPU is worth picking
My practical recommendation is this. If you buy a new laptop in 2025, almost any mid- or high-range machine will ship with an NPU, so the real question is how to weigh it. If you’ll be using local AI models routinely (assistants always running, real-time video processing, audio transcription), the NPU adds value: it extends battery and lets you keep the feature on without noticeable impact.
If your local-AI usage is occasional (load a model, run tests, generate some text), a reasonable GPU is still more versatile. You’ll be able to run larger models and have access to more tooling. The NPU doesn’t hurt, but it doesn’t change your life either.
Where I would pick carefully is in small, silent machines: ultrabooks, mini PCs, tablets. There, the NPU makes a real difference. A Snapdragon X Elite can run a decent Phi-3 without a fan, and no integrated GPU matches that.
Looking a bit ahead
What I think will happen over the next two years is that the ecosystem will level out. Runtimes will converge, popular frameworks will start exposing the NPU as a default option, and small models specifically designed for NPUs (like the Phi family or quantized Gemma versions) will become the natural use case. The transition echoes GPUs for scientific computing fifteen years ago: early on they needed special compilers and rewritten code; today anyone uses them without thinking.
In the meantime, if you already have an NPU in your machine, it’s worth knowing it’s there and starting to try what it enables. It doesn’t change the game yet, but its hardware presence is getting ahead of the software that will eventually use it. Getting to know it today is a reasonable afternoon investment, and it saves you from ignoring, a year from now, an accelerator you already have at home.