LLaMA (Meta AI)

LLaMA (Large Language Model Meta AI) is a family of primarily open-source large language models (LLMs) developed by the research division of Meta AI. LLaMA models are built on a modified Transformer architecture and are focused on high computational efficiency, democratizing access to advanced AI technologies, and easy adaptation for specialized tasks. The family has evolved from the initial research release of LLaMA 1 (February 2023) to the multimodal models of LLaMA 4 (planned for release in 2025).

Naming

The acronym LLaMA stands for Large Language Model Meta AI.

Large Language Model emphasizes the scale of the models, with parameters ranging from billions to trillions.
Meta AI indicates the developer, Meta's research group.

History

The development of LLaMA began in late 2022 as a strategic response by Meta to the success of OpenAI's ChatGPT. Mark Zuckerberg formed a cross-disciplinary team that included researchers from the FAIR (Facebook AI Research) lab. A key role in the project's philosophy was played by Yann LeCun, the head of FAIR, who had championed the principle of complete openness for all of the lab's research since 2013.

The first version, LLaMA 1, was released in February 2023 under a research license. Shortly after its release, in March 2023, the model's weights were leaked online via BitTorrent. This event, contrary to fears, did not halt but rather accelerated the project's development, as it allowed independent researchers and enthusiasts worldwide to experiment with the model. As a result, tens of thousands of derivative models appeared on the Hugging Face platform. Subsequent versions, starting with LLaMA 2, were released with a commercial license^[1], cementing LLaMA's status as a key player in the open AI model market.

Model Evolution and Release Chronology

Chronology of LLaMA Model Development
Version	Release Date	Parameter Range	Key Innovations and Features
LLaMA 1	February 2023	7B–65B	Base architecture (RMSNorm, SwiGLU, RoPE). Trained on 1.4 trillion tokens. 2048-token context window. Research license.
LLaMA 2	July 2023	7B–70B	Fine-tuned for dialogue (RLHF). Introduction of Grouped-Query Attention (GQA). 4096-token context window. First commercial license.
Code Llama	August 2023	7B–70B	Specialized version for code. Fine-tuned on 500 billion tokens of code. Variants: base, Python-specialized, instruction-tuned.
LLaMA 3	April 2024	8B, 70B	Trained on 15 trillion tokens. Improved tokenizer with a 128k token vocabulary. High performance (82% on MMLU).
LLaMA 3.1	July 2024^[2]	8B, 70B, 405B	Flagship 405B model with performance on par with GPT-4o. Context window up to 128k tokens. Image processing capabilities introduced.
LLaMA 4	(planned: April 2025)	109B (Scout), 400B (Maverick), 2T (Behemoth)	Mixture-of-Experts (MoE) architecture. Native multimodality (text, images, video). Context window up to 10 million tokens.

Architecture

LLaMA uses an autoregressive decoder-only transformer architecture but introduces several key improvements that enhance computational efficiency and the quality of generated text:

Pre-normalization. Normalization is applied at the input of each transformer sub-layer, rather than at the output. This approach stabilizes the training of very deep networks and prevents gradient-related issues.
RMSNorm (Root Mean Square Layer Normalization). Instead of the standard LayerNorm, RMSNorm is used. This normalization technique eliminates the mean subtraction operation, which speeds up computations by 10–50% while maintaining stability.
SwiGLU (Swish-Gated Linear Unit). SwiGLU is used as the activation function instead of ReLU or GELU. This gating mechanism creates a smoother gradient flow and improves model quality.
RoPE (Rotary Position Embeddings). To encode token positions, RoPE relative position embeddings are used, which allow the model to better extrapolate to sequences longer than those used during training.
GQA (Grouped-Query Attention). Introduced in LLaMA 2, this technique is an optimization of multi-head attention that significantly reduces memory requirements and accelerates text generation.
Mixture-of-Experts (MoE) (planned for LLaMA 4). An architecture that divides the model's parameters into "expert" sub-networks, activating only a small portion of them for each request. This drastically reduces the computational cost of inference.

LLaMA 1 Configurations

Architectural Parameters of LLaMA 1 Models
Model	Parameters	Hidden State Dimension	Number of Layers	Number of Attention Heads	Training Data Volume
7B	6.7B	4096	32	32	1.0T tokens
13B	13.0B	5120	40	40	1.0T tokens
33B	32.5B	6656	60	52	1.4T tokens
65B	65.2B	8192	80	64	1.4T tokens

Training Data

The volume of the training corpora grew from 1.4 trillion tokens for LLaMA 1 to 15 trillion for LLaMA 3. The training uses publicly available sources, including Common Crawl (which constitutes up to 67% of the data), C4, GitHub, Wikipedia, Books, ArXiv, and Stack Exchange. For LLaMA 3, high-quality private data was also used.

Performance and Comparison

On benchmarks: The LLaMA 3.1 (405B) model shows results close to GPT-4o: on the MMLU test, it achieves 88.6%, trailing GPT-4o by only 0.1 percentage points. On the HumanEval code generation task, LLaMA 3.1 scores 89% (GPT-4o — 90.2%).
Parameter efficiency: LLaMA models with fewer parameters often outperform larger competitor models. For example, LLaMA 1 (13B) surpassed GPT-3 (175B) on most tests.
Cost: When hosted locally, the inference cost of LLaMA can be up to 50 times lower compared to using proprietary APIs, making the technology accessible to small and medium-sized businesses.

Licensing

LLaMA 1 was distributed under a non-commercial research license with access available upon request.
LLaMA 2 and later versions are distributed under the Llama Community License, which permits commercial use and modification. However, the license contains restrictions: companies with more than 700 million monthly active users must obtain special permission from Meta. This has sparked debate about whether LLaMA is a fully open model.

Applications

LLaMA models are integrated into the products of thousands of companies and are used in various fields:

Corporate sector: Zoom uses LLaMA in its AI Companion for meeting summaries; Shopify uses it to process 40–60 million daily requests to enrich product metadata; Instacart uses it in its internal assistant, Ava.
Science and society: Meditron (an adaptation of LLaMA) is used for medical diagnosis in resource-limited regions.
Government and industry: Meta has formed partnerships with Lockheed Martin and Palantir. NASA uses LLaMA 3 on the ISS as an offline assistant to perform critical operations without communication with Earth.

Limitations and Criticism

Bias and safety: Independent audits show that despite safety measures, LLaMA models can reproduce harmful stereotypes. The leak of LLaMA 1's weights heightened concerns about the potential malicious use of the technology.
Knowledge gaps: In highly specialized domains, LLaMA can exhibit knowledge gaps. For example, its accuracy on the nephSAP medical test was 17–30%, compared to 73% for GPT-4.
Energy consumption: Training large models requires enormous resources. The training of LLaMA 1 required 2,638 MWh, equivalent to the emission of 1,015 tons of CO₂.

Future

Meta plans to invest up to $65 billion in AI infrastructure by 2025. The LLaMA 4 Behemoth model, with 2 trillion parameters, is under development. It will support over 200 languages and feature deep integration with metaverse products.

Literature

Ainslie, J. et al. (2023). GQA: Training Generalized Multi‑Query Transformer Models from Multi‑Head Checkpoints. arXiv:2305.13245.
Fedus, W.; Zoph, B.; Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv:2101.03961.
Grattafiori, A. et al. (2024). The Llama 3 Herd of Models. arXiv:2407.21783.
Jiang, Z. et al. (2023). Pre‑RMSNorm and Pre‑CRMSNorm Transformers: Equivalent and Efficient Pre‑LN Transformers. arXiv:2305.14858.
Rozière, B. et al. (2023). Code Llama: Open Foundation Models for Code. arXiv:2308.12950.
Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv:2002.05202.
Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
Touvron, H. et al. (2023). Llama 2: Open Foundation and Fine‑Tuned Chat Models. arXiv:2307.09288.
Zhang, B.; Sennrich, R. (2019). Root Mean Square Layer Normalization. arXiv:1910.07467.

Notes

↑ The LLaMA license does not meet all criteria for open-source software, as it imposes restrictions on commercial use by the largest companies and requires disclosure of modifications.
↑ LLaMA 3.1 was announced and released in July 2024. See the official Meta announcement.

LLaMA (Meta AI)

Contents

Naming

History

Model Evolution and Release Chronology

Architecture

LLaMA 1 Configurations

Training Data

Performance and Comparison

Licensing

Applications

Limitations and Criticism

Future

Literature

Notes

See also

Navigation menu

LLaMA (Meta AI)

Naming

History

Model Evolution and Release Chronology

Architecture

LLaMA 1 Configurations

Training Data

Performance and Comparison

Licensing

Applications

Limitations and Criticism

Future

Literature

Notes

See also

Navigation menu

Search