How to Install Ollama on macOS with Apple Silicon

MacBook con pantalla iluminada sobre escritorio minimalista en tema oscuro

Ollama is the most direct way to run large language models on an Apple Silicon Mac. A single command is enough to have Llama 3.1 8B or Mistral 7B answering requests from inside your laptop, with no accounts, no API keys, and not a single word of your conversation leaving the disk. This guide covers installation from scratch, model choice by RAM, and integrating the service with the applications you already use.

Why Ollama Works So Well on a Mac

The advantage of Apple Silicon over a traditional PC is not anecdotal, it is architectural. M1, M2 and M3 chips share memory between CPU and GPU instead of keeping it separate the way a machine with a discrete graphics card does. That unified memory means an eight-billion-parameter Llama 3.1 does not need to be copied across the PCIe bus so the GPU can process it: the same bytes are visible to both and inference accelerates without transfer tolls. Add Metal, Apple’s graphics layer, on top of which llama.cpp (the engine Ollama uses underneath) has had a very polished backend since 2023.

The practical upshot is that a silent, fanless MacBook Air M2 can serve a seven or eight-billion-parameter model at speeds perfectly usable for real work, while power draw barely exceeds that of a browser with a handful of tabs open. The memory bandwidth of these chips (100 GB/s on the base models, up to 800 GB/s on an M2 Ultra) is precisely the bottleneck that dominates inference for a quantized LLM.

Quick Comparison with LM Studio

The most common alternative on the Mac is LM Studio, a desktop application with a full graphical interface, built-in model browsing and a visual chat. The difference is not about engine quality (both use llama.cpp) but philosophy: LM Studio targets people who want a local ChatGPT-style experience without touching a terminal; Ollama is designed as a background service that exposes an API and has other tools connect to it. If you plan to wire a model into VS Code, scripts and a note-taking assistant at once, Ollama wins because they all speak to the same endpoint on localhost:11434 with no duplicated configuration.

Installation

Two paths end in the same background service. The graphical installer is downloaded from the official site, dragged into Applications, and on first run asks permission to launch as a background service. The clean alternative for technical users is Homebrew: brew install ollama followed by brew services start ollama leaves the daemon listening. ollama --version confirms the install and curl http://localhost:11434 should return “Ollama is running”. From there, the daemon listens on port 11434 and any ollama command talks to it.

Picking a Model by RAM

The rule is simple: a 4-bit quantized model takes up roughly half as many gigabytes as it has billions of parameters, and you need 2-4 GB free for the OS and whatever application you are using. With that arithmetic, an 8 GB MacBook Air M1 or M2 is comfortable with Phi-3 mini (2.3 GB) or Gemma 2B, useful for rewriting, summarization or translation, but insufficient for complex reasoning. The 16 GB of a MacBook Pro M2 Pro or an Air M3 opens the door to the best general-purpose compromise available right now: Llama 3.1 8B instruct in Q4_K_M quantization, which takes up around 4.7 GB and leaves headroom for the rest of your workflow. Mistral 7B instruct and Code Llama 7B play in the same league.

With 32 GB, typical of a MacBook Pro Max, Llama 3.1 70B quantized comes into play at around 40 GB, offering quality close to closed frontier models though with noticeably higher latency. Mixtral 8x7B (26 GB) is a strong alternative in multilingual work. With 64 GB or more (top-spec M3 Max, Mac Studio with M2 Ultra) the quantized 405B is only realistic on the M2 Ultra with 192 GB, at a speed better suited to curiosity than production.

Interactive Use and the Local API

The workflow is twofold. Conversational mode is invoked with ollama run llama3.1:8b and opens a >>> prompt, with internal commands like /set parameter temperature 0.7, /set parameter num_ctx 8192 to widen the context window, /show info and /bye to exit. ollama pull, ollama list and ollama rm manage downloaded models under ~/.ollama/models/.

The more interesting part, however, is the HTTP API that Ollama exposes on localhost:11434. There are two endpoints: its own (/api/generate, /api/chat) and one compatible with OpenAI’s format at /v1/chat/completions. That second endpoint is what makes any GPT-oriented client work by simply pointing it at the Mac, with no code changes. A Python example makes it concrete:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Explain RAG in three sentences."}],
)
print(response.choices[0].message.content)

The same pattern works from Node, from a VS Code extension, or from any utility you previously pointed at OpenAI.

Modelfiles, Integrations and Operation

When you end up repeating the same system prompt across several projects, it pays to pin it down in a Modelfile: a text file that starts from a base model with FROM llama3.1:8b, adds parameters with PARAMETER temperature 0.7, and sets the system prompt with a SYSTEM "You are a technical assistant, concise and direct." line. You build it with ollama create my-assistant -f Modelfile and invoke it like any other model. It is the local equivalent of a “custom GPT”.

On the ecosystem side, it is worth knowing about OpenWebUI (a ChatGPT-style web interface that talks to your local Ollama), Continue for code assistance in VS Code, Aider in the terminal, Raycast for quick queries from the menu bar, and the Copilot plugin for Obsidian to reason over your notes. They all speak to the same service.

In operation, Ollama consumes roughly 100 MB of RAM at idle, listens only on localhost by default (exposable to the LAN with OLLAMA_HOST=0.0.0.0) and stops with brew services stop ollama. Typical throughput: an M1 base with Phi-3 runs around 30 tokens per second, an M2 Pro with Llama 3.1 8B sits between 40 and 50, and an M3 Max with the quantized 70B drops to roughly 15, still comfortably conversational.

Conclusion

That a laptop without a discrete graphics card can run models equivalent to last year’s GPT-3.5 with full privacy, no network connection and a familiar API is still one of the more surprising facts of 2024. Ollama is not the answer for serious production (that space belongs to vLLM or TGI, designed for concurrency and throughput), but it is the best entry point for development and personal use. For lawyers, doctors, journalists or any professional handling sensitive documents, the guarantee that nothing is shipped off to a third party stops being a promise and becomes a verifiable property of the system. For everyone else, it remains the cheapest way to take back control over which model your machine uses and which data you hand it.

Entradas relacionadas