Jacar mascot — reading along A laptop whose eyes follow your cursor while you read.
Inteligencia Artificial

Mixtral 8x22B: Open and Powerful Mixture of Experts

Mixtral 8x22B: Open and Powerful Mixture of Experts

Actualizado: 2026-05-03

Mistral AI released Mixtral 8x22B on April 10, 2024, with their characteristic style: a Twitter magnet link without prior blog post or conferences. The community downloaded weights within hours and the next day benchmarks appeared. It’s the next generation of their MoE (Mixture of Experts) architecture, with 141B total parameters but only 39B active per forward pass. This changes the economics of serving open models.

Key takeaways

  • Mixtral 8x22B’s MoE architecture activates only 39B of its 141B parameters per token: large-model capability with medium-model inferential cost.
  • Apache 2.0 with no commercial restrictions — the most permissive large-scale option.
  • Superior multilingual vs Llama 3 70B, especially in Spanish, French, Italian, and German.
  • Minimum hardware is an A100 80GB or H100 80GB to serve quantised Q4; a 24 GB consumer GPU doesn’t reach.
  • Self-hosting pays off if you sustain more than 100M tokens/month; below that, hosted services are more efficient.

What Mixtral 8x22B Is

Sparse Mixture of Experts architecture:

  • 8 “experts” of 22B parameters each.
  • Router selecting 2 experts per token.
  • Total: 141B parameters on disk.
  • Active per forward pass: ~39B.

Result: ~141B capacity with ~39B inferential cost.

Key Benchmarks

Benchmark Mixtral 8x22B Llama 3 70B GPT-4 GPT-3.5
MMLU 77.8 79.5 86.4 70.0
HellaSwag 88.9 88.0 95.3 85.5
GSM8K 78.6 93.0 92.0 57.1
HumanEval 45.1 81.7 88.4 48.1
Multilingual (FR, ES, IT, DE) Excellent Good Excellent Medium

Superior multilingual vs Llama 3 70B — especially for European enterprise use. Behind on maths vs Llama 3 70B and on coding vs Claude 3 Opus.

Required Hardware: The Limiting Factor

Precision VRAM
FP16 ~280 GB
INT8 ~140 GB
INT4 (GGUF Q4_K_M) ~80 GB

A 4090 (24 GB) can’t serve it even quantised. One A100 80GB or H100 80GB handles Q4.

Production Serving

bash
# vLLM with tensor parallel
python -m vllm.entrypoints.openai.api_server 
  --model mistralai/Mixtral-8x22B-Instruct-v0.1 
  --tensor-parallel-size 2 
  --gpu-memory-utilization 0.9 
  --max-model-len 32768

vLLM for best GPU throughput. llama.cpp for portability and mixed CPU-GPU offload.

Conclusion

Mixtral 8x22B confirms that Mistral AI leads the European open frontier. Its MoE architecture attractively balances quality and inferential efficiency. For teams that can afford the hardware, it’s currently the best open option for serious multilingual cases. For those who can’t, Mixtral 8x7B remains valid as lighter option. And for serious production without own GPU, hosted services offer pay-per-token access. The open ecosystem continues closing the gap with closed frontier models.

Was this useful?
[Total: 0 · Average: 0]

Written by

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.