Llama 3.1 405B: When Open Catches Up With Closed Top-Tier

Montañas con niebla representando escala imponente de un modelo frontier abierto

Llama 3.1 405B, released by Meta on July 23, 2024, is the first open model seriously competing with GPT-4o and Claude 3.5 Sonnet. 405 billion parameters, 128k token context (vs 8k of Llama 3), trained on 15T tokens with significant reasoning and code improvements. For teams wanting open alternatives to commercial frontier, the moment has arrived.

What’s Different

Vs Llama 3 70B:

  • 405B parameters (5.8x).
  • 128k context (vs 8k).
  • Quality on par with GPT-4o on many benchmarks.
  • Expanded multilingual.
  • License maintains >700M MAU restrictions.

Llama 3.1 also refreshes 8B and 70B with same 128k context and quality improvements.

Benchmarks

Benchmark Llama 3.1 405B GPT-4o Claude 3.5 Sonnet
MMLU 88.6 88.7 88.7
HumanEval 89.0 90.2 92.0
GSM8K 96.8 95.8 95.0
MATH 73.8 76.6 71.1

Equivalent to closed frontier. For many tasks, indistinguishable.

Hardware

For inference:

  • FP16: ~810GB VRAM.
  • INT8: ~405GB.
  • INT4 (GGUF): ~220GB.

Implications:

  • Only serious on-prem or multi-A100/H100.
  • 2-4 × H100 80GB for Q4 with tensor parallelism.
  • Apple Silicon M2 Ultra 192GB fits Q4 (8-10 tokens/s).
  • Not viable on consumer hardware.

Access Options

If you can’t self-host:

  • Together.ai: pay-per-token, ~$3-5/1M.
  • Fireworks: similar.
  • Groq: extremely fast (>300 tokens/s on 405B via special hardware).
  • AWS Bedrock: enterprise-grade.
  • Vertex AI (Google): available.
  • Meta AI: consumer-facing.

Use Cases

Where 405B justifies:

  • Frontier tasks: complex reasoning, research.
  • Sophisticated multi-step agents.
  • Distillation: use 405B to generate training data for smaller models.
  • Compliance: those requiring self-hosted frontier.

Where 70B suffices:

  • Typical enterprise RAG.
  • Chat assistant.
  • Standard creative generation.

Cost difference 405B vs 70B is ~10x. Justify with use case.

Distillation: The Side Effect

405B release opened the door to “distillation” — using 405B to generate training data that improves 8B and 70B. This is a central reason Meta released it.

Community fine-tunes 8B-distilled from 405B already have surprising quality.

Limitations

  • Cost: prohibitive for modest self-hosting.
  • Latency: processing 128k tokens takes >1min.
  • License: Community license with >700M MAU restrictions.
  • Multimodality: text only (Llama 3.2 adds vision later).

Conclusion

Llama 3.1 405B closed the open-vs-closed gap in mid-2024. For companies with serving capacity or using hosted providers, it’s a real option vs GPT-4o/Claude 3.5. For most, Llama 3.1 70B remains more pragmatic. Historical importance exceeds practical adoption — it showed open-weight can reach frontier. Marks the shift where “only closed are frontier” stopped being true.

Follow us on jacar.es for more on open LLMs and frontier models.

Entradas relacionadas