Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Inteligencia Artificial Tecnología

oMLX: serve LLMs locally on Apple Silicon Macs without fighting Metal

oMLX: serve LLMs locally on Apple Silicon Macs without fighting Metal

Key takeaways

  • oMLX is an LLM inference server built on MLX (Apple’s framework for Apple Silicon) that adds what mlx-lm alone is missing for serious serving: continuous batching, tiered RAM+SSD KV cache with prefix sharing, multi-model with LRU eviction, and an OpenAI- and Anthropic-compatible API.
  • Current release is 0.3.8 (30 Apr 2026), Apache 2.0, 14.3k GitHub stars and 71 releases. Young project with a high cadence, not yet v1.0.
  • On an M1–M4 Mac running macOS 15+, it comfortably replaces Ollama as soon as you need concurrency or want to hit the OpenAI endpoint from your own app.
  • When NOT to pick it: if you need multi-OS portability (Linux, NVIDIA), enterprise features (built-in auth, OTEL, multi-tenancy), or you live on legacy GGUF, Ollama still wins.

Why Apple Silicon needs its own runtime

Unified memory (CPU, GPU and Neural Engine sharing the same DRAM) is the material difference between an M-series Mac and a box with an NVIDIA GPU. On an RTX 4090 you have 24 GB of separate VRAM and tensors travel over PCIe; on a 64 GB M4 Max, the model and the KV cache sit on the same bus as the CPU. The problem stops being “how many GB fit in VRAM” and becomes “how do I make use of unified memory and the Apple Matrix coprocessors without writing Metal kernels by hand”.

CUDA does not apply. ROCm does not either. Apple shipped MLX in December 2023 as the official answer: a NumPy/PyTorch-style array framework that compiles down to Metal and is built around unified memory. On top of MLX, the (Apple-owned) mlx-lm team maintains LLM weights converted to the MLX format (Qwen, Llama, Mistral, GLM, DeepSeek) with 4-bit and 8-bit quantizations tuned for the M-series.

mlx-lm on its own is a library. One model, one request at a time, no real batching. For anything beyond chatting with yourself, the server layer is missing. That is where oMLX fits.

What oMLX adds on top of mlx-lm

oMLX describes itself as “LLM inference, optimized for your Mac — continuous batching and tiered KV caching, managed directly from your menu bar”. The pieces that matter:

  • Continuous batching. When a client requests tokens, oMLX does not wait for it to finish before serving the next; it interleaves requests at the token level. If three users chat with the same model at once, all three advance in parallel instead of queueing. Without batching, an M4 Max feeding two terminals takes turns like a phone.
  • Tiered KV cache (hot RAM + cold SSD) with prefix sharing. The KV cache is memory of tokens already processed. oMLX keeps recent ones in RAM and parks the colder ones compressed on SSD, sharing prefixes across requests that start the same way (the classic “You are a helpful assistant…”). In practice you can keep long context windows without blowing through available RAM.
  • Multi-model with LRU eviction. Load Qwen3, Llama 3.3 and an OCR model at the same time, and oMLX decides which to evict when memory tightens. You can also unload manually from the dashboard if you want more control.
  • OpenAI- and Anthropic-compatible API. Point an OpenAI SDK client at http://localhost:8000/v1 and the integration works without rewriting. Same for tool calling and structured output.
  • Menu bar app and admin dashboard. Native macOS app in the top bar, plus a web dashboard with chat, model downloads and a benchmarking tool at :8000/admin/chat. You don’t need to live in the terminal.

VLM (Qwen3.5-VL, GLM-4V, Pixtral), OCR (DeepSeek-OCR, DOTS-OCR), embeddings (BGE-M3, ModernBERT) and rerankers all run in the same instance. For a RAG workflow where generation, embeddings and rerank share a process, that single endpoint is worth a lot.

What it looks like running

Install via Homebrew (there is also a .dmg release and pip install -e . from source):

sql
brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx
omlx serve --model-dir ~/models

Once it is up, download models from the dashboard at http://localhost:8000/admin/chat and hit any SDK against http://localhost:8000/v1. The menu bar shows loaded models, active requests, and RAM and SSD usage.

Pragmatic comparison (May 2026)

Four axes that actually matter on a Mac:

Install and models. Ollama is the most comfortable (curl | sh, its own library, GGUF). LM Studio is the “I do not want a terminal” option, with a full GUI. mlx-lm is plain pip install and models pulled directly from Hugging Face in MLX format. oMLX installs via Homebrew or .dmg, downloads MLX models from the dashboard, and as of v0.3.x can import any MLX repo from Hugging Face by pasting a URL.

Real batching under concurrency. Here Ollama’s llama-server falls short: it serializes requests per model. LM Studio is the same. mlx-lm, one at a time. oMLX is the only one of the four that does vLLM-style continuous batching. If two people, an agent, and an editor will be talking to the same model simultaneously, oMLX changes the feel.

Model format. Ollama lives on GGUF (llama.cpp). MLX and oMLX live on the MLX format, which quantizes specifically for Metal and makes better use of the Apple Matrix coprocessors on M3+. For the same model, MLX quantized to 4-bit typically pushes 15-30% more tokens/s on M3/M4 than the equivalent GGUF on Ollama, based on benchmarks circulating in mlx-community and Hacker News threads through 2025-2026. The gap is measurable, not a sales line.

API and tooling. Ollama has its own API plus partial OpenAI compatibility. LM Studio also exposes an OpenAI endpoint. mlx-lm is not a server. oMLX ships OpenAI- and Anthropic-compatible APIs, tool calling, structured output, and built-in benchmarking.

Verdict by scenario:

  • Daily Mac, one user, one chat: LM Studio or Ollama. You will not notice batching.
  • Demo to a client: LM Studio (the GUI sells itself).
  • OpenAI-style API for your own app or an agent you maintain: oMLX. Less friction and fewer surprises with tool calling.
  • Evaluation under concurrency, or RAG with generation, embeddings and rerank on the same host: oMLX, no contest.
  • Same workflow but you want to move it to Linux tomorrow: Ollama. Portability wins.

Before the numbers: oMLX shares unified RAM with macOS and everything else you have open. A useful rule of thumb is to subtract 10 to 14 GB from the total for Finder, Safari, the IDE and the rest before you start counting space for models and KV cache. Every size below assumes MLX 4-bit quantizations from mlx-community on Hugging Face[1], a single user and 4k context. Going from 4k to 32k can add 4 to 15 GB of KV cache depending on the model. oMLX’s cold SSD tier helps, but you still want headroom in hot RAM.

Chips at each tier:

  • 24 GB: M4 base (top config) and entry-level M4 Pro. When the M5 family ships, the base M5 will land here.
  • 32 GB: M4 base (top option) or base M5. Few Pro configurations sit here.
  • 64 GB: M4 Max (mid) and, expected through 2026, top M5 Pro and base M5 Max.
  • 128 GB: top M4 Max and high M5 Max. The expected M5 Ultra should open up 192 and 256 GB.

24 GB — about 12-14 GB usable

The well-configured M4 Mac Mini and the entry-level M4 Pro MacBook Pro. You pick either a strong model or a multi-model setup, not both.

  • Quality: Mistral-Small-3.2-24B-Instruct-4bit (~13 GB). Fits tight; leaves little for long context.
  • More comfortable: Qwen3-14B-Instruct-4bit (~8 GB). Good quality-to-RAM balance, with room for BGE-M3 (~1.2 GB) and a small reranker.
  • Pure speed: Llama-3.2-3B-Instruct-4bit or Gemma-3-4B-4bit (~2-3 GB), 80-120 tok/s on M4 Pro.
  • Viable VLM: Qwen2.5-VL-7B-4bit (~5 GB) for occasional OCR.

32 GB — about 18-22 GB usable

Enough for a serious chat model plus embeddings and a reranker in parallel.

  • Quality: Qwen3-32B-Instruct-4bit (~17 GB). The sweet spot for this tier: it punches close to a 70B on many tasks and leaves context headroom.
  • Speed: drop to Qwen3-14B-Instruct-4bit to hit 40-60 tok/s on M4 Pro.
  • Multi-model: 14B chat + Pixtral-12B-4bit (~7 GB) for vision + embed + reranker all loaded, with oMLX’s LRU eviction moving the cold ones to SSD as needed.

64 GB — about 50 GB usable

The tier where the Mac starts to look like a mid-range GPU box, without the fan noise that comes with one.

  • Quality: Llama-3.3-70B-Instruct-4bit (~40 GB), the reference for general writing and reasoning. Mistral-Large-2-123B only fits at 3-bit quantization and gets tight.
  • Speed without sacrificing much quality: Qwen3-30B-A3B-Instruct-4bit (MoE; takes ~17 GB but only activates ~3B parameters per token), 50-60 tok/s on M4 Max. The model that changes how this tier feels in daily use.
  • Realistic multi-model: 70B chat + Qwen2.5-VL-32B-4bit (~18 GB) + embeddings + reranker, with LRU moving things around when memory tightens.

128 GB — about 110 GB usable

This is where things fit that don’t fit on any reasonable consumer NVIDIA hardware. The “I bought a Mac instead of building a server” argument starts to stand on its own here.

  • Top quality: Mistral-Large-2-123B-Instruct-4bit (~70 GB) is the clear pick for general reasoning. DeepSeek-V3-MoE in aggressive quantization fits (~80 GB). Qwen3-235B-A22B-4bit pushes 140 GB; 3-bit is more realistic at the cost of some quality.
  • Speed: same MoE pattern (30B-A3B or equivalent) if you want low-latency agent loops.
  • Serious multi-model: 123B chat + a 70B alternative under LRU + Qwen2.5-VL-72B-4bit (~40 GB) for vision. Three large models loaded simultaneously with SSD cache for the cold ones is realistic on this tier.

Expected throughput (M4 Max, single user, 4k context)

Order-of-magnitude reference for dense 4-bit models: 14B around 50-70 tok/s, 32B around 25-35, 70B around 10-15, 123B around 6-9. MoE architectures (30B-A3B, 235B-A22B) tend to run like a dense model of their active size (~3B and ~22B respectively), so a 30B-A3B sustains 50-60 tok/s. On M4 Pro, trim those numbers by 30 to 40% on memory bandwidth alone; on M4 base, nearly half. M5 is expected to land 10 to 25% faster at the same tier, and the gain is more visible on prefill than on decode.

What oMLX is not yet

This is a young project. v0.3.8 is not v1.0. The instance ships with no out-of-the-box auth (you leave it listening on localhost or front it with a reverse proxy and basic auth), no native OTEL integration for metrics and traces, and it is not designed for multi-tenancy. If you come from enterprise with compliance on top, those layers are on you.

Another point: GGUF. If your current collection is GGUF files downloaded over the last year, oMLX cannot eat them. Either you convert to MLX (there are upstream scripts for popular models) or you stick with Ollama. Conversion is not trivial for exotic architectures or custom tokenizers, and it is worth checking the mlx-community repo on Hugging Face before converting anything by hand.

And, obviously, Mac. There is no Linux or Windows build. If your dev laptop is a Mac and your home server is a Linux box with NVIDIA, the oMLX binary will not serve the second one. You will end up running two different runtimes for the two worlds.

Where this fits for jacar.es readers

For a founder or CTO with an M3/M4 Mac who wants to try local AI before committing to dedicated infrastructure, the answer is yes. It saves you setting up Docker, configuring Open WebUI and exposing behind Traefik, which are the steps the Ollama + Llama 3.3 on Ubuntu tutorial covers once you have decided it is worth it. oMLX is step zero: check on your own laptop whether the open models are good enough before putting a GPU in production.

If it fits, the next step is wiring it to a real client: see how the Anthropic SDK agent tutorial reads when pointed at the local endpoint, or how it sits inside an MCP multi-vendor stack without paying for tokens every time.

Reference repos: github.com/jundot/omlx[2], github.com/ml-explore/mlx[3], github.com/ml-explore/mlx-lm[4], huggingface.co/mlx-community[1].

Was this useful?
[Total: 0 · Average: 0]
  1. mlx-community on Hugging Face
  2. github.com/jundot/omlx
  3. github.com/ml-explore/mlx
  4. github.com/ml-explore/mlx-lm

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.