Llama 3: Meta’s New Open Standard
Actualizado: 2026-05-03
Meta released Llama 3 on April 18, 2024 in two sizes: 8B and 70B, both with Instruct variants for chat. Trained on 15 trillion tokens — 7.5x more than Llama 2 — with a 128k vocabulary tokenizer and Grouped Query Attention on both sizes. It closes the gap that separated open models from closed frontier on many tasks.
Key takeaways
- 15T training tokens vs 2T for Llama 2: the data scale is the most visible difference in reasoning and instruction following.
- GQA on 8B and 70B: more efficient inference without sacrificing quality.
- Llama 3 70B competes with Claude 3 Sonnet on MMLU, HumanEval, and GSM8K.
- Llama 3 8B beats Llama 2 13B on almost all benchmarks with half the parameters.
- The Llama 3 Community License allows commercial use up to 700M MAU at no additional cost.
Key differences from Llama 2
- 15T training tokens vs 2T: 7.5x more data.
- Initial 8k context (extended to 128k in Llama 3.1).
- Improved tokenizer with 128k vocabulary vs 32k: more efficient tokenisation, especially for non-English languages.
- GQA on both sizes: better quality/inference-cost ratio.
- Significantly better instruction tuning: SFT + DPO + RLHF, less verbosity, better instruction adherence.
Benchmarks
| Benchmark | Llama 3 8B | Llama 3 70B | Claude 3 Sonnet | GPT-4 Turbo |
|---|---|---|---|---|
| MMLU | 68.4 | 79.5 | 79.0 | 86.4 |
| HumanEval | 62.2 | 81.7 | 73.0 | 85.4 |
| GSM8K | 79.6 | 93.0 | 92.3 | 92.0 |
Llama 3 70B is in Claude 3 Sonnet’s league on most tasks. Llama 3 8B beats Llama 2 13B on almost everything.
Hardware requirements
8B Q4 fits in 16 GB Apple Silicon. 70B Q4 requires an A100 80 GB or two A100 40 GB. For serious production throughput, vLLM with tensor parallelism is the standard for the 70B.
Licence
The Llama 3 Community License allows commercial use up to 700M MAU with a “Built with Meta Llama 3” display requirement. For the vast majority of organisations, the licence is permissive enough for production deployments.
Where it excels and where it doesn’t
Strong: code generation (HumanEval 62%/82%), maths reasoning (GSM8K 79-93%), instruction following. Relatively weak: multilingual (Mistral and Qwen are still better), long context (fixed with Llama 3.1), multimodal (fixed with Llama 3.2).
Conclusion
Llama 3 is a real leap over Llama 2 and sets the open reference standard. The 8B is the default option for modest self-hosting; the 70B competes with closed frontier on most tasks. Combined with a massive ecosystem of fine-tunes, quantised variants, and tooling, it’s the safe choice for teams serious about open LLMs. For extreme multilingual or very long context, Mixtral or Gemini remain preferable; for everything else, Llama 3 is the sensible default.