Mistral AI released Mixtral 8x22B on April 10, 2024, with their characteristic style: a Twitter magnet link without prior blog post or conferences. The community downloaded weights within hours. The next day benchmarks appeared. It’s the next generation of their MoE (Mixture of Experts) architecture, with 141B total parameters but only 39B active per forward pass. This changes the economics of serving open models.
What Mixtral 8x22B Is
Sparse Mixture of Experts architecture:
- 8 “experts” of 22B parameters each.
- Router selecting 2 experts per token.
- Total: 141B parameters on disk.
- Active per forward pass: ~39B (2 experts + shared components).
Result: ~141B capacity with ~39B inferential cost. Better quality/cost ratio than a dense model of equivalent size.
License and Distribution
Apache 2.0. No commercial-use restrictions. Weights are at:
- Hugging Face (base).
- Hugging Face (instruct).
- Original magnet links still work.
Compared to Llama 3 70B (more restrictive licence) or Claude 3 (closed), Mixtral 8x22B is the most permissive large-scale option.
Key Benchmarks
Numbers from Mistral and community:
| Benchmark | Mixtral 8x22B | Llama 3 70B | GPT-4 | GPT-3.5 |
|---|---|---|---|---|
| MMLU | 77.8 | 79.5 | 86.4 | 70.0 |
| HellaSwag | 88.9 | 88.0 | 95.3 | 85.5 |
| GSM8K | 78.6 | 93.0 | 92.0 | 57.1 |
| HumanEval | 45.1 | 81.7 | 88.4 | 48.1 |
| Multilingual (FR, ES, IT, DE) | Excellent | Good | Excellent | Medium |
Key points:
- General quality near Llama 3 70B, with more inferentially efficient architecture.
- Superior multilingual vs Llama 3 70B — especially Spanish, French, Italian, German.
- Behind on maths vs Llama 3 70B.
- Competitive coding but not top.
For EU multilingual cases, Mixtral 8x22B is likely the best open option.
Required Hardware
This is the limiting factor:
| Precision | VRAM |
|---|---|
| FP16 | ~280 GB |
| INT8 | ~140 GB |
| INT4 (GGUF Q4_K_M) | ~80 GB |
| INT3 | ~60 GB |
Practical implications:
- Doesn’t fit one consumer GPU: 4090 (24GB) can’t even quantised.
- One A100 80GB or H100 80GB can serve quantised Q4.
- 2x A100 40GB distributed with tensor parallelism works.
- Apple Silicon M3 Max 128GB: fits Q4 and works at ~5-10 tokens/s.
For serious production, almost always needs datacenter GPU.
Comparison with Mixtral 8x7B
The younger sibling (46.7B total, 12.9B active):
| Aspect | 8x7B | 8x22B |
|---|---|---|
| Total parameters | 46.7B | 141B |
| Active/token | 12.9B | 39B |
| Q4 VRAM | ~25GB | ~80GB |
| General quality | ~GPT-3.5 | ~Minor GPT-4 |
| Multilingual | Very good | Excellent |
| Tokens/s (A100 Q4) | ~60 | ~25 |
For many cases, 8x7B is more pragmatic: faster, cheaper, sufficient quality. 8x22B makes sense when quality matters more than throughput.
Production Serving
Typical stack:
# vLLM with tensor parallel
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mixtral-8x22B-Instruct-v0.1 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.9 \
--max-model-len 32768
For Q4 with llama.cpp:
./server -m mixtral-8x22b-instruct-Q4_K_M.gguf \
-c 16384 -ngl 99 --host 0.0.0.0 --port 8080
vLLM is best GPU throughput. llama.cpp is more portable and handles mixed CPU-GPU offload.
Fine-Tuning
LoRA on Mixtral 8x22B is feasible:
- QLoRA: possible on 4x A100 80GB.
- Expert-specific adaptation (MoE-aware fine-tuning) is active research.
- DPO for alignment after domain fine-tune.
For most enterprise cases, prompt engineering + RAG with Mixtral instruct without fine-tune covers. Fine-tune only when prompting clearly falls short.
Context Length
- Base: 64k tokens.
- Practical: ~32k without severe degradation.
- Decent “needle in haystack” up to ~32k, degrades beyond.
For moderate RAG or long context, sufficient. For full book analysis, Gemini 1.5 still leads.
Real Use Cases
Where Mixtral 8x22B shines:
- Enterprise multilingual: documents in ES/FR/IT/DE/EN.
- Mid-size code agents: not top-tier but capable.
- Long-context RAG.
- Complex summarisation and analysis.
- Self-hosting with strict compliance.
Where others win:
- Maths: Llama 3 70B or Claude 3 Opus.
- Top-tier coding: Claude 3 Opus, DeepSeek Coder.
- Ultra-long context: Gemini 1.5.
Serving Cost
Calculated:
- 1 × A100 80GB on-prem: ~$15k/year amortised.
- AWS p4d.24xlarge (8× A100 40GB): $32/hour = ~$23k/month.
- Together.ai hosted: ~$2/1M input + output tokens.
Self-hosting pays off if you sustain >100M tokens/month. Below, hosted is more efficient.
Alternatives in Open Space
As of April 2024:
- Llama 3 70B: better in math reasoning, more restrictive licence.
- Qwen 1.5 72B: strong multilingual, commercial licence under thresholds.
- DeepSeek 67B: excellent at code.
- Command R+ (Cohere): 104B dense, strong for RAG.
- Yi 34B: smaller size, competitive on many benchmarks.
Choice depends on concrete case. There’s no universal “best”.
Conclusion
Mixtral 8x22B confirms that Mistral AI leads the European open frontier. Its MoE architecture attractively balances quality and inferential efficiency. For teams that can afford the hardware, it’s currently the best open option for serious multilingual cases. For those who can’t, Mixtral 8x7B remains valid as lighter option. And for serious production without own GPU, hosted services like Together.ai, Anyscale, or Mistral La Plateforme offer pay-per-token access. The open ecosystem continues closing the gap with closed frontier models.
Follow us on jacar.es for more on open LLMs, MoE architectures, and model deployment.