LLaMA (Meta AI)
LLaMA (Large Language Model Meta AI) is a family of primarily open-source large language models (LLMs) developed by the research division of Meta AI. LLaMA models are built on a modified Transformer architecture and are focused on high computational efficiency, democratizing access to advanced AI technologies, and easy adaptation for specialized tasks. The family has evolved from the initial research release of LLaMA 1 (February 2023) to the multimodal models of LLaMA 4 (planned for release in 2025).
Naming
The acronym LLaMA stands for Large Language Model Meta AI.
- Large Language Model emphasizes the scale of the models, with parameters ranging from billions to trillions.
- Meta AI indicates the developer, Meta's research group.
History
The development of LLaMA began in late 2022 as a strategic response by Meta to the success of OpenAI's ChatGPT. Mark Zuckerberg formed a cross-disciplinary team that included researchers from the FAIR (Facebook AI Research) lab. A key role in the project's philosophy was played by Yann LeCun, the head of FAIR, who had championed the principle of complete openness for all of the lab's research since 2013.
The first version, LLaMA 1, was released in February 2023 under a research license. Shortly after its release, in March 2023, the model's weights were leaked online via BitTorrent. This event, contrary to fears, did not halt but rather accelerated the project's development, as it allowed independent researchers and enthusiasts worldwide to experiment with the model. As a result, tens of thousands of derivative models appeared on the Hugging Face platform. Subsequent versions, starting with LLaMA 2, were released with a commercial license[1], cementing LLaMA's status as a key player in the open AI model market.
Model Evolution and Release Chronology
| Version | Release Date | Parameter Range | Key Innovations and Features |
|---|---|---|---|
| LLaMA 1 | February 2023 | 7B–65B | Base architecture (RMSNorm, SwiGLU, RoPE). Trained on 1.4 trillion tokens. 2048-token context window. Research license. |
| LLaMA 2 | July 2023 | 7B–70B | Fine-tuned for dialogue (RLHF). Introduction of Grouped-Query Attention (GQA). 4096-token context window. First commercial license. |
| Code Llama | August 2023 | 7B–70B | Specialized version for code. Fine-tuned on 500 billion tokens of code. Variants: base, Python-specialized, instruction-tuned. |
| LLaMA 3 | April 2024 | 8B, 70B | Trained on 15 trillion tokens. Improved tokenizer with a 128k token vocabulary. High performance (82% on MMLU). |
| LLaMA 3.1 | July 2024[2] | 8B, 70B, 405B | Flagship 405B model with performance on par with GPT-4o. Context window up to 128k tokens. Image processing capabilities introduced. |
| LLaMA 4 | (planned: April 2025) | 109B (Scout), 400B (Maverick), 2T (Behemoth) | Mixture-of-Experts (MoE) architecture. Native multimodality (text, images, video). Context window up to 10 million tokens. |
Architecture
LLaMA uses an autoregressive decoder-only transformer architecture but introduces several key improvements that enhance computational efficiency and the quality of generated text:
- Pre-normalization. Normalization is applied at the input of each transformer sub-layer, rather than at the output. This approach stabilizes the training of very deep networks and prevents gradient-related issues.
- RMSNorm (Root Mean Square Layer Normalization). Instead of the standard LayerNorm, RMSNorm is used. This normalization technique eliminates the mean subtraction operation, which speeds up computations by 10–50% while maintaining stability.
- SwiGLU (Swish-Gated Linear Unit). SwiGLU is used as the activation function instead of ReLU or GELU. This gating mechanism creates a smoother gradient flow and improves model quality.
- RoPE (Rotary Position Embeddings). To encode token positions, RoPE relative position embeddings are used, which allow the model to better extrapolate to sequences longer than those used during training.
- GQA (Grouped-Query Attention). Introduced in LLaMA 2, this technique is an optimization of multi-head attention that significantly reduces memory requirements and accelerates text generation.
- Mixture-of-Experts (MoE) (planned for LLaMA 4). An architecture that divides the model's parameters into "expert" sub-networks, activating only a small portion of them for each request. This drastically reduces the computational cost of inference.
LLaMA 1 Configurations
| Model | Parameters | Hidden State Dimension | Number of Layers | Number of Attention Heads | Training Data Volume |
|---|---|---|---|---|---|
| 7B | 6.7B | 4096 | 32 | 32 | 1.0T tokens |
| 13B | 13.0B | 5120 | 40 | 40 | 1.0T tokens |
| 33B | 32.5B | 6656 | 60 | 52 | 1.4T tokens |
| 65B | 65.2B | 8192 | 80 | 64 | 1.4T tokens |
Training Data
The volume of the training corpora grew from 1.4 trillion tokens for LLaMA 1 to 15 trillion for LLaMA 3. The training uses publicly available sources, including Common Crawl (which constitutes up to 67% of the data), C4, GitHub, Wikipedia, Books, ArXiv, and Stack Exchange. For LLaMA 3, high-quality private data was also used.
Performance and Comparison
- On benchmarks: The LLaMA 3.1 (405B) model shows results close to GPT-4o: on the MMLU test, it achieves 88.6%, trailing GPT-4o by only 0.1 percentage points. On the HumanEval code generation task, LLaMA 3.1 scores 89% (GPT-4o — 90.2%).
- Parameter efficiency: LLaMA models with fewer parameters often outperform larger competitor models. For example, LLaMA 1 (13B) surpassed GPT-3 (175B) on most tests.
- Cost: When hosted locally, the inference cost of LLaMA can be up to 50 times lower compared to using proprietary APIs, making the technology accessible to small and medium-sized businesses.
Licensing
- LLaMA 1 was distributed under a non-commercial research license with access available upon request.
- LLaMA 2 and later versions are distributed under the Llama Community License, which permits commercial use and modification. However, the license contains restrictions: companies with more than 700 million monthly active users must obtain special permission from Meta. This has sparked debate about whether LLaMA is a fully open model.
Applications
LLaMA models are integrated into the products of thousands of companies and are used in various fields:
- Corporate sector: Zoom uses LLaMA in its AI Companion for meeting summaries; Shopify uses it to process 40–60 million daily requests to enrich product metadata; Instacart uses it in its internal assistant, Ava.
- Science and society: Meditron (an adaptation of LLaMA) is used for medical diagnosis in resource-limited regions.
- Government and industry: Meta has formed partnerships with Lockheed Martin and Palantir. NASA uses LLaMA 3 on the ISS as an offline assistant to perform critical operations without communication with Earth.
Limitations and Criticism
- Bias and safety: Independent audits show that despite safety measures, LLaMA models can reproduce harmful stereotypes. The leak of LLaMA 1's weights heightened concerns about the potential malicious use of the technology.
- Knowledge gaps: In highly specialized domains, LLaMA can exhibit knowledge gaps. For example, its accuracy on the nephSAP medical test was 17–30%, compared to 73% for GPT-4.
- Energy consumption: Training large models requires enormous resources. The training of LLaMA 1 required 2,638 MWh, equivalent to the emission of 1,015 tons of CO₂.
Future
Meta plans to invest up to $65 billion in AI infrastructure by 2025. The LLaMA 4 Behemoth model, with 2 trillion parameters, is under development. It will support over 200 languages and feature deep integration with metaverse products.
Literature
- Ainslie, J. et al. (2023). GQA: Training Generalized Multi‑Query Transformer Models from Multi‑Head Checkpoints. arXiv:2305.13245.
- Fedus, W.; Zoph, B.; Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv:2101.03961.
- Grattafiori, A. et al. (2024). The Llama 3 Herd of Models. arXiv:2407.21783.
- Jiang, Z. et al. (2023). Pre‑RMSNorm and Pre‑CRMSNorm Transformers: Equivalent and Efficient Pre‑LN Transformers. arXiv:2305.14858.
- Rozière, B. et al. (2023). Code Llama: Open Foundation Models for Code. arXiv:2308.12950.
- Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv:2002.05202.
- Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
- Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
- Touvron, H. et al. (2023). Llama 2: Open Foundation and Fine‑Tuned Chat Models. arXiv:2307.09288.
- Zhang, B.; Sennrich, R. (2019). Root Mean Square Layer Normalization. arXiv:1910.07467.
Notes
- ↑ The LLaMA license does not meet all criteria for open-source software, as it imposes restrictions on commercial use by the largest companies and requires disclosure of modifications.
- ↑ LLaMA 3.1 was announced and released in July 2024. See the official Meta announcement.
See also
- GPT
- Large language models
- Transformer (neural network architecture)