Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Cómo Instalar Inteligencia Artificial

How to Install Ollama to Run LLMs on Your Computer

How to Install Ollama to Run LLMs on Your Computer

Actualizado: 2026-05-03

Ollama[1] is the least painful way to run a large language model on your own computer. It essentially wraps llama.cpp[2] in a docker run-like UX: one binary, one command, and a quantised model downloading to disk. Until recently, setting this up by hand meant juggling CPU feature flags, hunting leaked weights, and reconciling file formats that shifted every couple of weeks.

Key takeaways

  • Ollama available for macOS (Apple Silicon and x86), Linux, and Windows (via Docker or WSL2).
  • Llama 2 was the first model with a clear commercial licence; previously the weights were leaked material with murky legal status.
  • On macOS with 16 GB unified memory, a 7B model runs fluidly without a dedicated GPU.
  • The REST API exposes an OpenAI-compatible endpoint: swapping api.openai.com for localhost:11434 is trivial.
  • It doesn’t replace frontier models for complex reasoning, non-trivial code, or maths; it is competent for summarisation, rewriting, RAG, and offline chat.

Why local inference became practical

In February the original LLaMA weights leaked, and within days the community showed that a 7B model could run on a laptop with 4-bit quantisation. llama.cpp was born out of that. But the legal status of those weights was murky, and every tutorial started with “first get the torrent.”

On 18 July Meta released Llama 2 under a licence allowing commercial use, and the question shifted from “can I download this?” to “how do I run it well?” Ollama arrived precisely when there were clean weights, a stabilising file format, and enough kernel-level optimisation to make a 16 GB M2 a viable inference platform.

There’s also an economic motivation: the OpenAI API bill starts to sting when prototyping. A script evaluating 10,000 prompts against GPT-3.5 costs real money; against a local Llama 2 7B it costs electricity.

What you can and can’t do, honestly

With available open models you cannot replace GPT-4. The gap in complex reasoning, long-instruction following, and non-trivial code is substantial. You will not solve competition mathematics or build reliable tool-using agents with multi-step loops.

What does work reasonably well:

  • Summarising a document that fits in context.
  • Rewriting and translating text.
  • Generating boilerplate code.
  • Answering simple factual questions.
  • Acting as an offline chat assistant.
  • Feeding RAG pipelines where retriever quality matters more than generator quality.

Mistral 7B and Llama 2 13B are surprisingly competent at these tasks, and they are competent without sending a single byte to someone else’s server.

Installation on macOS, Linux, and Windows

macOS with Apple Silicon is where Ollama shines most: unified memory lets you load 13B models without a dedicated card and without paging. Install with a one-line script or by downloading the .dmg that leaves a menu-bar icon.

Linux: the same script detects the distribution (Ubuntu, Debian, Fedora, Arch), downloads the binary, creates an ollama system user, and starts a systemd service. If an NVIDIA GPU with drivers and CUDA is already present, it’s detected and used automatically.

Windows: still no native installer. The clean path is WSL2 (Ubuntu inside Windows with access to the host’s NVIDIA GPU); the frictionless alternative is the official Docker image exposing port 11434.

bash
# Linux/macOS
curl -fsSL https://ollama.ai/install.sh | sh
ollama run llama2

The first run downloads the quantised weights (around 4 GB for Llama 2 7B in 4-bit), caches them under ~/.ollama/models, and opens an interactive chat. Switching models is as simple as ollama run mistral or ollama run llama2:13b. List what you have with ollama list; delete with ollama rm.

The service starts a daemon at localhost:11434 with its own REST API and an OpenAI-compatible endpoint, letting you point the Python openai library or LangChain at the local server by changing only the base URL.

Hardware comparison for local LLM inference: RAM requirements by model size and expected speed

Models worth trying

The official library hosts several dozen variants. Reasonable picks:

  • llama2 (7B, 3.8 GB): the workhorse, runs on any machine with 8 GB.
  • llama2:13b: if you have 16 GB RAM or more and want better coherence.
  • Instruction-tuned derivatives like WizardLM variants: improve instruction following over the original base.
  • The 70b exists but needs at least 48 GB RAM; more curiosity than practical tool.

Don’t chase every release: the ecosystem ships variants weekly and most are marginal iterations over the same base models.

Hardware, without the mythology

As a rough guide:

  • 8 GB RAM: a quantised 7B runs but slowly; the machine is under stress.
  • 16 GB: a 7B is fluent and a 13B is usable.
  • 32 GB: comfortable territory for 13B and experimenting with 34B.
  • 64 GB or GPU with lots of VRAM: needed for 70B.

An NVIDIA card with 8 GB or more accelerates inference by a factor of 5-10 over pure CPU. On Mac, all RAM counts as effective VRAM, which is why a 32 GB MacBook Pro is currently one of the best inference machines per euro spent.

Where it goes next

Ollama is the first rung. For serious use, combine it with:

  • A UI like Open WebUI[3] for chat with history.
  • An editor plugin like Continue.dev[4] for Copilot-style autocomplete.
  • A RAG stack on LangChain for querying your own documents.

OpenAI API compatibility means swapping api.openai.com for localhost:11434 in existing applications is almost trivial — which changes the economics of every prototype.

Also see generative AI regulation for the compliance context that accompanies proprietary models, and OpenAI’s code-interpreter as a comparison point with cloud models.

Conclusion

Local inference has gone from academic exercise to legitimate engineering option. It doesn’t replace frontier models, but it opens a parallel lane where privacy, zero marginal cost, and zero network latency are guaranteed by construction, not promised by contract. For anyone working with sensitive data, or simply trying to understand how these systems work from the inside, this is a good moment to start.

Frequently asked questions

What are the minimum requirements to run Ollama?

Ollama runs on macOS, Linux, and Windows. On Linux, a 64-bit processor and at least 8 GB of RAM are recommended for 7B models. An NVIDIA or AMD GPU significantly speeds up inference.

Can I run Ollama without a GPU?

Yes. Ollama can run models on CPU only, though generation speed is much slower. For practical CPU-only use, 4-bit quantized models like llama3.2:3b offer the best balance.

How do I update Ollama to the latest version?

On Linux, run the official script again: curl -fsSL https://ollama.com/install.sh | sh. It detects the existing installation and updates it without removing downloaded models.

Where are Ollama downloaded models stored?

On Linux, models are stored in ~/.ollama/models. You can change the location with the OLLAMA_MODELS environment variable before starting the service.

Was this useful?
[Total: 15 · Average: 4.3]
  1. Ollama
  2. llama.cpp
  3. Open WebUI
  4. Continue.dev

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.