Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Inteligencia Artificial

cpu inference gguf llama.cpp metal quantization

llama.cpp: Optimisations That Keep Surprising

December 1, 2024 12 min read 55 reads

Table of contents

Key takeaways
What It Is and Why It Matters
The 2024 Headliners
Backends and Compilation
Quantisation and the GGUF Format
Server Mode and Bindings
Apple Silicon and the Mac Case
When to Go Direct and When Ollama Suffices
Conclusion

Actualizado: 2026-05-03

The llama.cpp^[1] project, created by Georgi Gerganov, has become the piece on which nearly the entire local-LLM ecosystem leans. Ollama, LM Studio, Jan, Msty and dozens of lesser-known tools are, underneath, convenient wrappers around this C++ library. 2024 moved at a breathless pace: speculative decoding, distributed inference, new GPU backends and a finally stable GGUF format. It is worth looking carefully at what has changed and why understanding the tool directly still pays off, even when most day-to-day traffic flows through Ollama.

Key takeaways

llama.cpp is a pure C++ inference library that compiles with almost no external dependencies.
GGUF is now the de-facto standard for distributing quantised weights in the local ecosystem.
Speculative decoding delivers 2-3× speedups at no quality cost when the model pair is chosen well.
Q4_K_M is the sweet spot for large models: good quality, manageable size.
Ollama covers 90% of cases; go direct with llama.cpp when you need to squeeze uncommon hardware or specific flags.

What It Is and Why It Matters

llama.cpp is an inference library written in plain C++, with almost no external dependencies, that compiles anywhere a decent toolchain exists. It ships a command-line interface, an OpenAI-compatible server and support for the major accelerators on the market. Its native format, GGUF, has standardised as the canonical way to ship quantised weights: a single self-contained file holding model metadata, tokeniser and reduced-precision tensors.

The reason for its success is simple. Against vLLM, which shines in production with multi-GPU servers and aggressive batching, llama.cpp focuses on the individual case: one user, one machine, ideally free of Python dependencies. Against Ollama, which prioritises experience over control, llama.cpp exposes every knob and dial. And against proprietary solutions like Apple’s MLX, it keeps portability as a non-negotiable principle.

The 2024 Headliners

Speculative decoding is probably the year’s most tangible improvement. A small, fast draft model generates several tokens ahead; the large one verifies them in parallel and accepts those that match its own prediction. When the draft is right, the useful tokens per large-model pass multiply, with two- to three-fold speedups at no quality cost.

The second addition is the RPC server, which shards a model’s layers across several networked machines. It does not replace a real multi-GPU system, but it does turn three modest laptops into a platform capable of running models none of them would swallow alone.

Finally, 2024 cemented tool use with compatible models (Llama 3.1 onwards): GBNF grammars to force valid outputs, Jinja templates and structured JSON generation with syntactic guarantees.

Backends and Compilation

The backend matrix is surprisingly complete today. On CPU it exploits AVX2, AVX-512 and NEON depending on the architecture; for GPU there are dedicated paths for:

CUDA on NVIDIA.
Metal on Apple Silicon.
ROCm on AMD.
SYCL on Intel graphics.
Vulkan or OpenCL as cross-platform alternatives.

Enabling each backend is a one-variable compilation step and the choice depends on the goal: CUDA delivers the highest raw throughput, Metal wins on efficiency, Vulkan buys cross-vendor reach and CPU remains the last refuge when the hardware is unhelpful.

Quantisation and the GGUF Format

Quantisation is where llama.cpp distances itself most from generic solutions. The useful range narrows to three anchors:

Q8_0: effectively indistinguishable from the original, serves as a quality-loss baseline.
Q5_K_M: a balance hard to beat — 1-2% behind on real benchmarks, roughly half the size of Q8.
Q4_K_M: the sweet spot for large models — a 70B fits comfortably in 48 GB of VRAM or a Mac with generous unified memory.

Below that, Q3_K_M and Q2_K begin to show: fragile reasoning, more hallucinations, subtle errors. The IQ-family quants — IQ4_NL, IQ3_XXS — apply an importance-based strategy and at equal size usually outperform their classic counterparts.

Server Mode and Bindings

llama-server spins up an OpenAI-compatible endpoint, which means any GPT client can be pointed at it without changing a line of code. This is what Ollama packages, but reaching the same point yourself takes a single command and unlocks settings Ollama hides: custom context sizes, specific sampling parameters, explicit layer offloading with -ngl or on-the-fly LoRA adapter application.

bash

./llama-server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 99

For those who prefer Python, the llama-cpp-python package wraps the same library with an idiomatic interface and its own server.

Apple Silicon and the Mac Case

Macs have become the reference hardware for large local models. Unified memory eliminates the classic bottleneck between system RAM and VRAM: an M3 Max with 128 GB can load a 70B quantised to Q4 and reserve generous context without shuffling across buses. The Metal backend is polished enough that, on efficiency per watt, it beats many mid-range NVIDIA setups. The result is 60-80 tokens per second on Llama 3 8B Q4, against 150+ on an RTX 4090 but at a fraction of the power draw.

When to Go Direct and When Ollama Suffices

Ollama is enough for 90% of cases. Calling llama.cpp directly pays off when you need to:

Squeeze uncommon hardware with specific flags.
Embed the binary in a Python-free application.
Experiment with sampling flags or run a feature weeks before Ollama adopts it.
Build services needing a single static binary in offline environments.

For production with concurrent users, however, the honest recommendation is vLLM: llama.cpp is optimised for a single inference flow at a time.

Conclusion

The fascinating thing about llama.cpp is that, while being the invisible engine behind nearly everything, it keeps an iteration speed no wrapper can match. Commits land daily, backends renew in weeks and GGUF has survived the pressure of being both the ecosystem’s lingua franca and a testbed for new quantisation ideas at once.

The sensible stance is to treat it the way one treats a compiler or a kernel: no need to rebuild it every week, but worth understanding what it offers and how it behaves under load. When the day comes that Ollama does not support the backend you need, knowing how to drop one layer down stops being optional. And that day arrives sooner than expected.

Was this useful?

[Total: 11 · Average: 4.5]

Post Views: 55

llama.cpp

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

llama.cpp: Optimisations That Keep Surprising

Key takeaways

What It Is and Why It Matters

The 2024 Headliners

Backends and Compilation

Quantisation and the GGUF Format

Server Mode and Bindings

Apple Silicon and the Mac Case

When to Go Direct and When Ollama Suffices

Conclusion

Related posts

Multi-agent systems: LangGraph vs CrewAI vs Autogen in 2026

How to build a production-ready agent with the Anthropic SDK, step by step

Claude Code vs Cursor vs GitHub Copilot in 2026: a comparison with measured tasks

MCP (Model Context Protocol) in 2026: the complete guide for engineering teams