Mixtral 8x22B: Open and Powerful Mixture of Experts
Actualizado: 2026-05-03
Mistral AI released Mixtral 8x22B on April 10, 2024, with their characteristic style: a Twitter magnet link without prior blog post or conferences. The community downloaded weights within hours and the next day benchmarks appeared. It’s the next generation of their MoE (Mixture of Experts) architecture, with 141B total parameters but only 39B active per forward pass. This changes the economics of serving open models.
Key takeaways
- Mixtral 8x22B’s MoE architecture activates only 39B of its 141B parameters per token: large-model capability with medium-model inferential cost.
- Apache 2.0 with no commercial restrictions — the most permissive large-scale option.
- Superior multilingual vs Llama 3 70B, especially in Spanish, French, Italian, and German.
- Minimum hardware is an A100 80GB or H100 80GB to serve quantised Q4; a 24 GB consumer GPU doesn’t reach.
- Self-hosting pays off if you sustain more than 100M tokens/month; below that, hosted services are more efficient.
What Mixtral 8x22B Is
Sparse Mixture of Experts architecture:
- 8 “experts” of 22B parameters each.
- Router selecting 2 experts per token.
- Total: 141B parameters on disk.
- Active per forward pass: ~39B.
Result: ~141B capacity with ~39B inferential cost.
Key Benchmarks
| Benchmark | Mixtral 8x22B | Llama 3 70B | GPT-4 | GPT-3.5 |
|---|---|---|---|---|
| MMLU | 77.8 | 79.5 | 86.4 | 70.0 |
| HellaSwag | 88.9 | 88.0 | 95.3 | 85.5 |
| GSM8K | 78.6 | 93.0 | 92.0 | 57.1 |
| HumanEval | 45.1 | 81.7 | 88.4 | 48.1 |
| Multilingual (FR, ES, IT, DE) | Excellent | Good | Excellent | Medium |
Superior multilingual vs Llama 3 70B — especially for European enterprise use. Behind on maths vs Llama 3 70B and on coding vs Claude 3 Opus.
Required Hardware: The Limiting Factor
| Precision | VRAM |
|---|---|
| FP16 | ~280 GB |
| INT8 | ~140 GB |
| INT4 (GGUF Q4_K_M) | ~80 GB |
A 4090 (24 GB) can’t serve it even quantised. One A100 80GB or H100 80GB handles Q4.
Production Serving
# vLLM with tensor parallel
python -m vllm.entrypoints.openai.api_server
--model mistralai/Mixtral-8x22B-Instruct-v0.1
--tensor-parallel-size 2
--gpu-memory-utilization 0.9
--max-model-len 32768vLLM for best GPU throughput. llama.cpp for portability and mixed CPU-GPU offload.
Conclusion
Mixtral 8x22B confirms that Mistral AI leads the European open frontier. Its MoE architecture attractively balances quality and inferential efficiency. For teams that can afford the hardware, it’s currently the best open option for serious multilingual cases. For those who can’t, Mixtral 8x7B remains valid as lighter option. And for serious production without own GPU, hosted services offer pay-per-token access. The open ecosystem continues closing the gap with closed frontier models.