llama.cpp: Optimisations That Keep Surprising

llama.cpp is the C++ library that powers Ollama and much of the local-LLM ecosystem. 2024 added speculative decoding with two- to three-fold speedups, an RPC server for sharding layers across machines, and a stable GGUF format. Ollama covers 90% of cases; going direct pays off with uncommon hardware or specific flags.

December 1, 2024 6 min 251 4.5