Theoretical foundations of large language models

The theoretical foundations of large language models (based on the Transformer architecture) are the set of mathematical, statistical, and information-theoretic principles that underpin the functioning, training, and capabilities of modern large language models (LLMs). These foundations explain how models built on the Transformer architecture are able to understand and generate human language with a high degree of coherence.

Architectural Foundations: The Transformer Architecture

Modern LLMs are almost entirely based on the Transformer architecture, introduced in the 2017 paper "Attention Is All You Need." This architecture abandoned recurrent layers (as in RNNs and LSTMs), relying instead on the attention mechanism, which enabled efficient processing of long sequences and parallelization of computations.

Self-Attention Mechanism

This is the core of the Transformer architecture. The self-attention mechanism allows the model to weigh the importance of each word (token) in a sequence relative to all other words in the same sequence. For each token, three vectors are created:

Query (Q): a vector representing the current word.
Key (K): a vector against which queries from other words are compared.
Value (V): a vector containing the information about the word to be passed on.

The attention score is calculated as a scaled dot-product:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

where $d_{k}$ is the dimensionality of the key vectors. This mechanism allows the model to capture complex contextual dependencies, regardless of the distance between words.

Multi-Head Attention is the parallel execution of several such computations with different projection matrices, allowing the model to simultaneously focus on different aspects of syntax and semantics.

Types of Transformer-Based Architectures

There are three main ways to use the Transformer components:

Encoder-Decoder: The classic architecture for sequence-to-sequence tasks (e.g., machine translation). The encoder processes the input sequence, and the decoder generates the output sequence. Examples: T5, BART.
Encoder-Only: Models that use only the encoder stack. They are well-suited for tasks requiring a deep understanding of the context of the entire sequence (text classification, named entity recognition). Example: BERT.
Decoder-Only: Models that use only the decoder stack. They operate autoregressively, predicting the next token based on the preceding ones. This is the standard for generative models. Examples: GPT, LLaMA, Claude.

Positional Encoding

Since the self-attention mechanism does not account for word order, positional encoding is added to the architecture. Vectors encoding the position of tokens in the sequence are added to their embeddings. The original model used sinusoidal functions:

PE (pos, 2 i) = \sin (pos / 1000 0^{2 i / d_{model}})

PE (pos, 2 i + 1) = \cos (pos / 1000 0^{2 i / d_{model}})

Modern models also use learned and rotary (Rotary Position Embeddings, RoPE) positional encodings.

Training Principles: From Probability to Optimization

Language Modeling as a Probabilistic Task

At the core of an LLM is the task of language modeling—predicting the probability of a text sequence. Formally, for a sequence $X = (x_{1}, x_{2}, \dots, x_{T})$ , the model estimates the probability $P (X)$ . Using the chain rule of probability, this is decomposed into a product of conditional probabilities:

P (X) = \prod_{t = 1}^{T} P (x_{t} | x_{1}, \dots, x_{t - 1})

Thus, training the model is reduced to predicting the next token $x_{t}$ based on the context of the preceding tokens.

Loss Function and Information Theory

To evaluate the quality of predictions and train the model, the cross-entropy loss function is used. It measures the divergence between the probability distribution predicted by the model ( $q$ ) and the true distribution ( $p$ ), where the correct next token has a probability of 1 and all others have a probability of 0.

H (p, q) = - \sum_{i} p (i) \log q (i)

Minimizing cross-entropy is equivalent to maximizing the likelihood of the training data.

A related quality metric is perplexity, which is defined as the exponent of the cross-entropy: $Perplexity = 2^{H (p, q)}$ . Intuitively, perplexity indicates the average number of choices the model is "deciding" between at each step. The lower the perplexity, the more confident and accurate the model.

Optimization

Training an LLM is a process of minimizing the loss function by adjusting the model's billions of parameters. This is done using methods based on gradient descent. The most common optimizer is Adam (Adaptive Moment Estimation) and its variants (e.g., AdamW), which adaptively adjust the learning rate for each parameter.

Training Paradigms

Pre-training: The model is trained on vast, unlabeled text corpora (e.g., Common Crawl, The Pile, C4) using self-supervised tasks, such as:
- Causal Language Modeling (CLM): Predicting the next token (used in GPT).
- Masked Language Modeling (MLM): Reconstructing randomly masked tokens in the text (used in BERT).
Fine-tuning: After pre-training, the model is adapted to specific tasks on smaller, labeled datasets.
Alignment: A special fine-tuning stage aimed at aligning the model's behavior with human preferences and values. A key method is RLHF (Reinforcement Learning from Human Feedback), where the model is fine-tuned using a reward signal from a model that predicts human preferences.

Scaling Laws and Emergent Abilities

Empirical studies have shown that LLM performance predictably improves with an increase in three factors: model size (number of parameters, $N$ ), training dataset size ( $D$ ), and computational budget ( $C$ ). This relationship is described by power laws (scaling laws).

The law proposed in the OpenAI paper (Kaplan et al., 2020) shows that the loss function $L$ decreases as a power function of $N$ , $D$ , and $C$ . A later paper by DeepMind (Hoffmann et al., 2022) refined these laws (the Chinchilla scaling laws), demonstrating that for optimal training, both model size and data volume must be increased in a balanced way.

An important consequence of scaling is the appearance of emergent abilities—qualitative leaps in performance where the model begins to solve tasks it was not explicitly trained on (e.g., arithmetic, logical reasoning, code writing). These abilities are typically absent in smaller models and only manifest after a certain scale threshold is reached.

Text Generation: Decoding Strategies

After training, the model generates text by iteratively predicting the next token. The choice of the next token from the probability distribution output by the model is made using various decoding strategies:

Greedy Search: Always selects the most probable token. It is fast but often leads to repetitive and dull text.
Beam Search: At each step, the $k$ most likely sequences are kept, which helps in finding more globally optimal solutions.
Sampling with Temperature: Token probabilities are adjusted by a temperature parameter ( $T$ ). For $T > 1$ , the distribution becomes more uniform (more creativity), while for $T < 1$ , it becomes more peaked (less randomness).
Top-k Sampling: At each step, sampling is restricted to the $k$ most likely tokens.
Top-p (Nucleus) Sampling: Sampling is restricted to the smallest set of tokens whose cumulative probability exceeds a threshold $p$ . This allows for dynamic adaptation of the candidate pool size.

Theoretical Problems and Limitations

Hallucinations: The tendency of models to generate factually incorrect but plausible-sounding information. This is because models are optimized for text probability, not truthfulness.
Bias: LLMs inherit and amplify social, cultural, and other biases present in their training data.
Interpretability ("Black Box" Problem): Due to the vast number of parameters, it is extremely difficult to understand exactly how the model makes its decisions, which complicates debugging and creates risks.
Computational Complexity: The self-attention mechanism has a quadratic complexity with respect to sequence length ( $O (n^{2})$ ), which limits the maximum length of the context that can be processed.

Literature

Vaswani, A. et al. (2017). Attention Is All You Need. arXiv:1706.03762.
Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.
Brown, T. B. et al. (2020). Language Models Are Few-Shot Learners. arXiv:2005.14165.
Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.
Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models. arXiv:2203.15556.
Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.
Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback. arXiv:2203.02155.
Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
Bubeck, S. et al. (2023). Sparks of Artificial General Intelligence: Early Experiments with GPT-4. arXiv:2303.12712.
Touvron, H. et al. (2024). The Llama 3 Herd of Models. arXiv:2407.21783.
Bender, E. M. et al. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. DOI:10.1145/3442188.3445922.

Theoretical foundations of large language models

Contents

Architectural Foundations: The Transformer Architecture

Self-Attention Mechanism

Types of Transformer-Based Architectures

Positional Encoding

Training Principles: From Probability to Optimization

Language Modeling as a Probabilistic Task

Loss Function and Information Theory

Optimization

Training Paradigms

Scaling Laws and Emergent Abilities

Text Generation: Decoding Strategies

Theoretical Problems and Limitations

See Also

Literature

Navigation menu

Theoretical foundations of large language models

Architectural Foundations: The Transformer Architecture

Self-Attention Mechanism

Types of Transformer-Based Architectures

Positional Encoding

Training Principles: From Probability to Optimization

Language Modeling as a Probabilistic Task

Loss Function and Information Theory

Optimization

Training Paradigms

Scaling Laws and Emergent Abilities

Text Generation: Decoding Strategies

Theoretical Problems and Limitations

See Also

Literature

Navigation menu

Search