Mixtral 8x22B: Open and Powerful Mixture of Experts

Grupo de procesadores interconectados iluminados representando arquitectura de expertos

Mistral AI released Mixtral 8x22B on April 10, 2024, with their characteristic style: a Twitter magnet link without prior blog post or conferences. The community downloaded weights within hours. The next day benchmarks appeared. It’s the next generation of their MoE (Mixture of Experts) architecture, with 141B total parameters but only 39B active per forward pass. This changes the economics of serving open models.

What Mixtral 8x22B Is

Sparse Mixture of Experts architecture:

  • 8 “experts” of 22B parameters each.
  • Router selecting 2 experts per token.
  • Total: 141B parameters on disk.
  • Active per forward pass: ~39B (2 experts + shared components).

Result: ~141B capacity with ~39B inferential cost. Better quality/cost ratio than a dense model of equivalent size.

License and Distribution

Apache 2.0. No commercial-use restrictions. Weights are at:

Compared to Llama 3 70B (more restrictive licence) or Claude 3 (closed), Mixtral 8x22B is the most permissive large-scale option.

Key Benchmarks

Numbers from Mistral and community:

Benchmark Mixtral 8x22B Llama 3 70B GPT-4 GPT-3.5
MMLU 77.8 79.5 86.4 70.0
HellaSwag 88.9 88.0 95.3 85.5
GSM8K 78.6 93.0 92.0 57.1
HumanEval 45.1 81.7 88.4 48.1
Multilingual (FR, ES, IT, DE) Excellent Good Excellent Medium

Key points:

  • General quality near Llama 3 70B, with more inferentially efficient architecture.
  • Superior multilingual vs Llama 3 70B — especially Spanish, French, Italian, German.
  • Behind on maths vs Llama 3 70B.
  • Competitive coding but not top.

For EU multilingual cases, Mixtral 8x22B is likely the best open option.

Required Hardware

This is the limiting factor:

Precision VRAM
FP16 ~280 GB
INT8 ~140 GB
INT4 (GGUF Q4_K_M) ~80 GB
INT3 ~60 GB

Practical implications:

  • Doesn’t fit one consumer GPU: 4090 (24GB) can’t even quantised.
  • One A100 80GB or H100 80GB can serve quantised Q4.
  • 2x A100 40GB distributed with tensor parallelism works.
  • Apple Silicon M3 Max 128GB: fits Q4 and works at ~5-10 tokens/s.

For serious production, almost always needs datacenter GPU.

Comparison with Mixtral 8x7B

The younger sibling (46.7B total, 12.9B active):

Aspect 8x7B 8x22B
Total parameters 46.7B 141B
Active/token 12.9B 39B
Q4 VRAM ~25GB ~80GB
General quality ~GPT-3.5 ~Minor GPT-4
Multilingual Very good Excellent
Tokens/s (A100 Q4) ~60 ~25

For many cases, 8x7B is more pragmatic: faster, cheaper, sufficient quality. 8x22B makes sense when quality matters more than throughput.

Production Serving

Typical stack:

# vLLM with tensor parallel
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mixtral-8x22B-Instruct-v0.1 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768

For Q4 with llama.cpp:

./server -m mixtral-8x22b-instruct-Q4_K_M.gguf \
  -c 16384 -ngl 99 --host 0.0.0.0 --port 8080

vLLM is best GPU throughput. llama.cpp is more portable and handles mixed CPU-GPU offload.

Fine-Tuning

LoRA on Mixtral 8x22B is feasible:

  • QLoRA: possible on 4x A100 80GB.
  • Expert-specific adaptation (MoE-aware fine-tuning) is active research.
  • DPO for alignment after domain fine-tune.

For most enterprise cases, prompt engineering + RAG with Mixtral instruct without fine-tune covers. Fine-tune only when prompting clearly falls short.

Context Length

  • Base: 64k tokens.
  • Practical: ~32k without severe degradation.
  • Decent “needle in haystack” up to ~32k, degrades beyond.

For moderate RAG or long context, sufficient. For full book analysis, Gemini 1.5 still leads.

Real Use Cases

Where Mixtral 8x22B shines:

  • Enterprise multilingual: documents in ES/FR/IT/DE/EN.
  • Mid-size code agents: not top-tier but capable.
  • Long-context RAG.
  • Complex summarisation and analysis.
  • Self-hosting with strict compliance.

Where others win:

  • Maths: Llama 3 70B or Claude 3 Opus.
  • Top-tier coding: Claude 3 Opus, DeepSeek Coder.
  • Ultra-long context: Gemini 1.5.

Serving Cost

Calculated:

  • 1 × A100 80GB on-prem: ~$15k/year amortised.
  • AWS p4d.24xlarge (8× A100 40GB): $32/hour = ~$23k/month.
  • Together.ai hosted: ~$2/1M input + output tokens.

Self-hosting pays off if you sustain >100M tokens/month. Below, hosted is more efficient.

Alternatives in Open Space

As of April 2024:

  • Llama 3 70B: better in math reasoning, more restrictive licence.
  • Qwen 1.5 72B: strong multilingual, commercial licence under thresholds.
  • DeepSeek 67B: excellent at code.
  • Command R+ (Cohere): 104B dense, strong for RAG.
  • Yi 34B: smaller size, competitive on many benchmarks.

Choice depends on concrete case. There’s no universal “best”.

Conclusion

Mixtral 8x22B confirms that Mistral AI leads the European open frontier. Its MoE architecture attractively balances quality and inferential efficiency. For teams that can afford the hardware, it’s currently the best open option for serious multilingual cases. For those who can’t, Mixtral 8x7B remains valid as lighter option. And for serious production without own GPU, hosted services like Together.ai, Anyscale, or Mistral La Plateforme offer pay-per-token access. The open ecosystem continues closing the gap with closed frontier models.

Follow us on jacar.es for more on open LLMs, MoE architectures, and model deployment.

Entradas relacionadas