How to Install Ollama to Run LLMs on Your Computer
Table of contents
- Key takeaways
- Why local inference became practical
- What you can and can’t do, honestly
- Installation on macOS, Linux, and Windows
- Models worth trying
- Hardware, without the mythology
- Where it goes next
- Conclusion
- Frequently asked questions
- What are the minimum requirements to run Ollama?
- Can I run Ollama without a GPU?
- How do I update Ollama to the latest version?
- Where are Ollama downloaded models stored?
Actualizado: 2026-05-03
Ollama[1] is the least painful way to run a large language model on your own computer. It essentially wraps llama.cpp[2] in a docker run-like UX: one binary, one command, and a quantised model downloading to disk. Until recently, setting this up by hand meant juggling CPU feature flags, hunting leaked weights, and reconciling file formats that shifted every couple of weeks.
Key takeaways
- Ollama available for macOS (Apple Silicon and x86), Linux, and Windows (via Docker or WSL2).
- Llama 2 was the first model with a clear commercial licence; previously the weights were leaked material with murky legal status.
- On macOS with 16 GB unified memory, a 7B model runs fluidly without a dedicated GPU.
- The REST API exposes an OpenAI-compatible endpoint: swapping
api.openai.comforlocalhost:11434is trivial. - It doesn’t replace frontier models for complex reasoning, non-trivial code, or maths; it is competent for summarisation, rewriting, RAG, and offline chat.
Why local inference became practical
In February the original LLaMA weights leaked, and within days the community showed that a 7B model could run on a laptop with 4-bit quantisation. llama.cpp was born out of that. But the legal status of those weights was murky, and every tutorial started with “first get the torrent.”
On 18 July Meta released Llama 2 under a licence allowing commercial use, and the question shifted from “can I download this?” to “how do I run it well?” Ollama arrived precisely when there were clean weights, a stabilising file format, and enough kernel-level optimisation to make a 16 GB M2 a viable inference platform.
There’s also an economic motivation: the OpenAI API bill starts to sting when prototyping. A script evaluating 10,000 prompts against GPT-3.5 costs real money; against a local Llama 2 7B it costs electricity.
What you can and can’t do, honestly
With available open models you cannot replace GPT-4. The gap in complex reasoning, long-instruction following, and non-trivial code is substantial. You will not solve competition mathematics or build reliable tool-using agents with multi-step loops.
What does work reasonably well:
- Summarising a document that fits in context.
- Rewriting and translating text.
- Generating boilerplate code.
- Answering simple factual questions.
- Acting as an offline chat assistant.
- Feeding RAG pipelines where retriever quality matters more than generator quality.
Mistral 7B and Llama 2 13B are surprisingly competent at these tasks, and they are competent without sending a single byte to someone else’s server.
Installation on macOS, Linux, and Windows
macOS with Apple Silicon is where Ollama shines most: unified memory lets you load 13B models without a dedicated card and without paging. Install with a one-line script or by downloading the .dmg that leaves a menu-bar icon.
Linux: the same script detects the distribution (Ubuntu, Debian, Fedora, Arch), downloads the binary, creates an ollama system user, and starts a systemd service. If an NVIDIA GPU with drivers and CUDA is already present, it’s detected and used automatically.
Windows: still no native installer. The clean path is WSL2 (Ubuntu inside Windows with access to the host’s NVIDIA GPU); the frictionless alternative is the official Docker image exposing port 11434.
# Linux/macOS
curl -fsSL https://ollama.ai/install.sh | sh
ollama run llama2The first run downloads the quantised weights (around 4 GB for Llama 2 7B in 4-bit), caches them under ~/.ollama/models, and opens an interactive chat. Switching models is as simple as ollama run mistral or ollama run llama2:13b. List what you have with ollama list; delete with ollama rm.
The service starts a daemon at localhost:11434 with its own REST API and an OpenAI-compatible endpoint, letting you point the Python openai library or LangChain at the local server by changing only the base URL.

Models worth trying
The official library hosts several dozen variants. Reasonable picks:
llama2(7B, 3.8 GB): the workhorse, runs on any machine with 8 GB.llama2:13b: if you have 16 GB RAM or more and want better coherence.- Instruction-tuned derivatives like WizardLM variants: improve instruction following over the original base.
- The
70bexists but needs at least 48 GB RAM; more curiosity than practical tool.
Don’t chase every release: the ecosystem ships variants weekly and most are marginal iterations over the same base models.
Hardware, without the mythology
As a rough guide:
- 8 GB RAM: a quantised 7B runs but slowly; the machine is under stress.
- 16 GB: a 7B is fluent and a 13B is usable.
- 32 GB: comfortable territory for 13B and experimenting with 34B.
- 64 GB or GPU with lots of VRAM: needed for 70B.
An NVIDIA card with 8 GB or more accelerates inference by a factor of 5-10 over pure CPU. On Mac, all RAM counts as effective VRAM, which is why a 32 GB MacBook Pro is currently one of the best inference machines per euro spent.
Where it goes next
Ollama is the first rung. For serious use, combine it with:
- A UI like Open WebUI[3] for chat with history.
- An editor plugin like Continue.dev[4] for Copilot-style autocomplete.
- A RAG stack on LangChain for querying your own documents.
OpenAI API compatibility means swapping api.openai.com for localhost:11434 in existing applications is almost trivial — which changes the economics of every prototype.
Also see generative AI regulation for the compliance context that accompanies proprietary models, and OpenAI’s code-interpreter as a comparison point with cloud models.
Conclusion
Local inference has gone from academic exercise to legitimate engineering option. It doesn’t replace frontier models, but it opens a parallel lane where privacy, zero marginal cost, and zero network latency are guaranteed by construction, not promised by contract. For anyone working with sensitive data, or simply trying to understand how these systems work from the inside, this is a good moment to start.
Frequently asked questions
What are the minimum requirements to run Ollama?
Ollama runs on macOS, Linux, and Windows. On Linux, a 64-bit processor and at least 8 GB of RAM are recommended for 7B models. An NVIDIA or AMD GPU significantly speeds up inference.
Can I run Ollama without a GPU?
Yes. Ollama can run models on CPU only, though generation speed is much slower. For practical CPU-only use, 4-bit quantized models like llama3.2:3b offer the best balance.
How do I update Ollama to the latest version?
On Linux, run the official script again: curl -fsSL https://ollama.com/install.sh | sh. It detects the existing installation and updates it without removing downloaded models.
Where are Ollama downloaded models stored?
On Linux, models are stored in ~/.ollama/models. You can change the location with the OLLAMA_MODELS environment variable before starting the service.