Large language model architectures

Large Language Model (LLM) architectures are the fundamental principles and structures that define how large language models are built, trained, and operated. Modern LLMs, capable of understanding and generating human language, are almost entirely based on the Transformer architecture^[1], but incorporate numerous enhancements and different approaches aimed at improving efficiency, scalability, and capabilities.

Families of LLM Architectures (Transformers)

Modern large language models are based on the Transformer architecture^[1], but they utilize it differently depending on the objective: to understand text, generate a continuation, or transform one text into another. In practice, three main families are distinguished, all while preserving the core principles of the Transformer^[2]^[3]^[4].

1. Encoder-only

The model uses only a stack of encoders and processes the entire input text bidirectionally. Pre-training is typically based on masked-language modeling (MLM): some tokens are masked, and the model learns to reconstruct them from their context. Thanks to this bidirectional context, such models excel at understanding and scoring tasks, such as classification, named entity recognition, document re-ranking, and extractive QA. They are not designed for autoregressive generation from scratch.

Additionally, alternative pre-training objectives are used for the encoder-only family in practice: replaced token detection (RTD) in ELECTRA (where a discriminator model identifies replaced tokens) and contrastive learning for bi-encoders in semantic search/retrieval (InfoNCE/softmax loss on "query-document" pairs, as in Dense Passage Retrieval). When used in RAG, encoder-only models act either as a bi-encoder (separate encoding of query and document for fast ANN search) or as a cross-encoder (joint encoding of the pair for precise re-ranking).

Advantages:

High-quality text understanding due to bidirectional context: classification, NER, fact extraction, re-ranking, extractive QA.
Parallel processing and high throughput: a single forward pass without auto-regression; convenient for batching mass scoring tasks.
Natural integration with search and RAG: as a bi-encoder for fast semantic search; as a cross-encoder for precise re-ranking.
Effective adaptation: relatively compact variants (≈100–300M parameters; BERT-base ≈110M) achieve high quality after task-specific fine-tuning.
Stable latency, independent of the generated response length (no step-by-step decoding); well-suited for offline scoring of large collections.
Ability to extend the context window using relative/rotary position embeddings and/or local-sparse attention (e.g., Longformer/BigBird), which is useful for long documents.

Disadvantages:

No native generative capabilities: requires a decoder or an external generative module for dialogues and detailed answers.
Limited in interactive scenarios: no step-by-step generation with state preservation.
Mismatch between pre-training objective and free-form generation tasks: MLM is less aligned with generation compared to causal language modeling.
Historically limited context window (often 512 tokens in base configurations with absolute positions); extension requires special position/attention schemes and/or fine-tuning.
For retrieval tasks, separate contrastive fine-tuning of a bi-encoder and/or cross-encoder is required; without it, search/re-ranking quality is typically lower than that of specially trained models.

Representative models: BERT and its derivatives, as well as RoBERTa and DeBERTa (advanced encoder-only variants); from alternative pre-training objectives, ELECTRA (RTD). ^[5]^[6]^[7]^[8]^[9]^[10]

2. Decoder-only

This architecture uses only a stack of decoders with causal (left-to-right) attention: the model predicts the next token based on a given prefix. This training mode—causal language modeling (CLM)—makes these models a natural choice for generation tasks: dialogues, detailed answers, creative text, and code. The trade-off is an increase in latency and KV cache size with long prompts. In practice, various engineering techniques are widely used for decoder-only models: reducing the KV cache with MQA and GQA, accelerating inference with speculative decoding, and server-side optimizations (PagedAttention/vLLM, continuous batching, chunked prefill).^[11]^[12]^[13]^[14]^[15]

Advantages:

Natural text generation (CLM): strong zero-shot and few-shot capabilities; scales well.^[16]
Versatility: a single model can solve many tasks through instructions and examples in the prompt; naturally combines with RAG and tool use.
Mature ecosystem: established practices for instruction fine-tuning and alignment (RLHF, DPO); open-source and commercial implementations are available.^[17]^[18]
Rich stack of inference optimizations: MQA/GQA reduce KV cache size and increase throughput; speculative decoding accelerates inference without changing the output distribution; PagedAttention/vLLM with continuous batching and chunked prefill improve end-to-end GPU utilization.^[19]^[20]^[21]^[22]^[23]
Support for structured generation for strict output formats (JSON/SQL/DSL), which simplifies integration with information systems and APIs.^[24]^[25]

Disadvantages:

Increased generation latency: sequential output; the cost of a new token grows with the length of the already processed context (KV cache).
Less efficient for "long input–short output" profiles (summarization, translation) compared to encoder-decoder models, where the input is encoded only once.
Limited by unidirectional context: in understanding tasks, it can sometimes be inferior to models with bidirectional representations (encoder-only / encoder-decoder).
Memory for the KV cache can be a bottleneck with long prompts and large batches. ^[26]
Quantization of activations/KV (INT8/FP8) speeds up inference but can degrade quality on long contexts/code; requires careful validation (especially with strict SLAs).

Representative models: GPT-3, GPT-4 (architecture and dataset details are not publicly disclosed), LLaMA and Llama 3 (8B/70B, 2024).^[27]^[28]^[29]^[30]

3. Encoder-decoder

This architecture combines both components. The encoder operates bidirectionally, while the decoder operates causally. The encoder analyzes the input once to form its representation; the decoder then generates the output by attending to this representation via cross-attention. This separate approach is particularly useful for tasks that require transforming a long input text into a short output, such as machine translation, summarization, and document-based question answering. Although this method requires greater overall computational resources (due to two stacks and cross-attention), its advantage is controlled generation based on a complete analysis of the source text. The encoding is performed only once and is reused throughout the entire inference process.

Advantages:

Conditional generation: the decoder uses cross-attention to the input representation. ^[31]
Efficient in "long input → short output" scenarios: the input is encoded only once.
Convenient for the "text-to-text" format and controlled output (task-specific prefixes, special instructions). ^[32]
Stability and efficiency with long source texts: during the decoding phase, only the self-attention over the output grows, while cross-attention reuses fixed keys/values from the encoder (the input is not "re-read" at each step).

Disadvantages:

Two stacks increase memory and computational requirements for training and deployment.
On extremely long sequences, the total latency is comparable to decoder-only models; auto-regression remains the bottleneck.
Fewer universal chat models compared to the decoder-only family; more often used as a high-quality seq2seq engine for specific tasks.
With very long inputs, the memory for cross-attention keys/values in each decoder layer increases (across the entire source), requiring careful serving planning.

Representative models: T5 (including T5 v1.1 and the instruction fine-tuning practice in FLAN-T5) and BART. ^[33]^[34]^[35]

Dense Transformers

The classic and most common LLM architecture, where nearly the entire set of model parameters is engaged in processing each token. Unlike sparse approaches (e.g., Mixture-of-Experts), there is no selective activation of subnetworks—every block operates on every token. ^[1]

Principle of Operation and Architecture

Basic Structure. The model is a stack of N identical Transformer blocks. Each block includes:

Multi-Head Self-Attention. For each token, three vectors are computed: Q (query), K (key), and V (value); attention is defined as $softmax (\frac{Q K^{⊤} + M}{\sqrt{d_{k}}}) \cdot V$ , where $M$ is a mask (causal and/or padding mask) that excludes invalid positions. Multiple attention "heads" run in parallel to consider different aspects of the context (H heads, usually $d_{h e a d} = \frac{d_{m o d e l}}{H}$ ); their number grows with the model's scale. ^[1]
Feed-Forward Network (FFN). Two linear layers with a non-linearity between them (typically GELU/SiLU; in several modern models, SwiGLU). The intermediate dimensionality is typically $\approx 4 d_{m o d e l}$ ; when using SwiGLU, it is often set to $\approx \frac{8}{3} d_{m o d e l}$ to maintain a comparable number of parameters. The FFN contains a significant portion of the parameters. ^[1]^[36]

Additional Components. Residual connections and layer normalization are used; modern LLMs more often apply Pre-LN (normalization before sub-blocks), which improves training stability at greater depths. In addition to classic LayerNorm, RMSNorm is increasingly used (it reduces computational overhead and performs well in large models); some families also apply normalization in the attention space (e.g., normalizing Q/K before softmax). Positional representations can be absolute or relative; for long contexts, RoPE has become the de facto standard.

Examples of Models and Scale

BERT-Large: 24 layers, 1024 hidden size, 16 attention heads, ≈340M parameters. ^[37]
GPT-3 (175B): 96 layers, 12288 hidden size, 96 attention heads, ≈175B parameters. ^[38]
LLaMA-65B: 80 layers, 8192 hidden size, 64 attention heads, ≈65B parameters. ^[39]
PaLM-540B: 118 layers, hidden size of around 18432, ≈540B parameters. ^[40]

Advantages

Uniform blocks, well-studied training regimes, and predictable scaling behavior.
Quality improves as a power law with the growth of parameters and data; the compute-optimal regime involves increasing both model size and the volume of training tokens. ^[41]^[42]
The same architecture can cover a wide range of tasks after fine-tuning, without changes at the layer level.

Disadvantages

Full self-attention has quadratic complexity with respect to sequence length ( $O (n^{2})$ ), which limits the context window. ^[1]
Full parameter activation at each generation step: in a non-MoE decoder, the inference cost per token grows approximately proportionally to the number of parameters.
Bottlenecked by memory bandwidth (memory-bound): loading weights from HBM often limits inference speed.

Scaling and Context Limitations

Memory for parameters grows linearly with model size; training memory increases due to gradients and optimizer states.
Base configurations were historically limited to 2,000–4,000 tokens. Modern positional schemes (RoPE) and extension techniques (Position Interpolation, YaRN, etc.) can increase the window by an order of magnitude or more, but at the cost of additional computational/memory load. ^[43]^[44]

Modern Optimizations

FlashAttention. An exact attention implementation that is aware of the GPU memory hierarchy; it reduces memory costs and accelerates training/inference for long sequences. ^[45]
KV Cache Reduction and Management. Multi-Query Attention and Grouped-Query Attention reduce cache size and memory traffic; at the server level, PagedAttention (vLLM) increases throughput via page-based cache management. ^[46]^[47]^[48]
Speculative Decoding. A draft model proposes a continuation, which the main model quickly verifies; this achieves acceleration without changing the output distribution. ^[49]

Sparse Models and Mixture-of-Experts (MoE)

MoE is a method for increasing a model's capacity without a proportional increase in computation per token. Instead of a single large FFN block, each layer uses a set of parallel "experts" (multiple independent FFNs), and a trainable router (gating network) selects the top-k most relevant experts for each token (typically k=1–2; in some models, k=4). Only the selected experts are activated; their outputs are weighted and summed. This allows the total number of parameters to reach hundreds of billions or even trillions, while only a small fraction is engaged at each step. ^[50]^[51]

Examples of Models and Scale

Switch Transformer (Google): up to ~1.6T parameters; top-1 routing (one expert per token). Showed that MoE allows for a dramatic increase in capacity at comparable per-token costs. ^[50]
GLaM (Google): 1.2T parameters, 64 experts per layer, top-2; ≈96.6B parameters (≈8%) are activated for each token. ^[51]
Mixtral 8×7B (Mistral AI): ~46.7B total parameters, ≈12.9B active per token, top-2. ^[52]^[53]
Mixtral 8×22B: ~141B total parameters, ≈39B active per token, top-2. ^[54]
DBRX (Databricks): 132B total parameters, ≈36B active per token; 16 experts and top-4 routing (fine-grained MoE). ^[55]

Advantages

Computational cost is determined by the number of active experts (k), not the total number of parameters: trillion-scale models can be trained and used with costs comparable to much smaller dense models. ^[51]
Specialization: experts automatically "adapt" to languages/domains/patterns, improving quality in multi-domain tasks.
Flexible deployment: frequently used experts can be kept in memory, while rare ones are loaded on demand (with appropriate infrastructure).

Limitations

Load balancing: without regularization, the router can "stick" to a subset of experts (router collapse). Auxiliary losses (load-balancing) and improved routing schemes are needed. ^[50]
Complexity of distributed computing: requires expert parallelism and all-to-all communication; communication overhead and memory management become bottlenecks. ^[56]
Training stability: router settings and capacity limits are crucial; otherwise, quality/convergence degradation is possible.

Modern Improvements

Expert-Choice routing: experts "choose" tokens, which improves load balancing and convergence at comparable costs. ^[57]
Fine-grained MoE: a larger number of smaller experts (as in DBRX) provides finer granularity of specialization. ^[55]
Sparse Upcycling: converting a dense model into an MoE from its checkpoint can significantly improve quality at a moderate cost. ^[58]

When to Use MoE

Large multi-domain assistants with a limited compute budget.
Training on vast corpora where specialization offers an advantage.
Scenarios with advanced distributed infrastructure (many GPUs/TPUs and fast networks).

When dense models are better: limited infrastructure (1–2 GPUs), strict requirements for predictable latency, and simplicity of deployment.

Retrieval-Augmented Generation (RAG)

RAG is an architectural system pattern built around an LLM, rather than an internal architecture of the model itself. It combines an LLM (the generative component) with an external knowledge base (the retrieval component), which helps compensate for the limitations of the model's "parametric memory".

Principle of Operation: Before generating a response, the LLM retrieves relevant documents from an external source (a wiki, a corporate knowledge base, the web) and relies on them to formulate the answer. ^[59]
Advantages:
- Reduced hallucinations and improved factual accuracy. ^[59]^[60]
- Up-to-date information without fully retraining the model. ^[59]
- Citable and traceable responses.
Application: The de facto standard for enterprise assistants and systems requiring verifiable facts and operation on private/specialized data. ^[59]

Attention Mechanisms and Context Handling

Basic self-attention has quadratic complexity with respect to sequence length ( $O (n^{2})$ ), leading to the development of optimizations.

Sparse Attention: Restricts attention to local windows/patterns. Examples: Longformer^[61], BigBird^[62].
FlashAttention: Reorders computations to account for the GPU memory hierarchy; it provides significant gains in speed and memory usage and has become the de facto standard for training LLMs with long contexts^[63]^[64]^[65].
MQA/GQA (decoding acceleration): Multi-Query Attention (shared keys/values for all heads) reduces KV cache traffic^[66]. Grouped-Query Attention balances quality and speed^[67].
Improved positional representations:
- ALiBi (Attention with Linear Biases): Linear biases added to attention scores improve generalization to longer lengths. ^[68]
- RoPE (Rotary Position Embeddings): Relative positional information via rotation of Q/K; widely used in modern models (e.g., LLaMA). ^[69]^[70]
- Context extension for RoPE models: Position Interpolation ^[71], YaRN ^[72], and NTK-aware modifications allow for efficient context window extension without architectural changes.

Other approaches for long sequences:
- Transformer-XL: Recurrent memory between segments to model long-range dependencies. ^[73]
- Reformer: LSH-attention and reversible residual blocks to save memory. ^[74]
- Performer: Linear approximation of softmax-attention (FAVOR+). ^[75]
- Linformer: Low-rank approximation of the attention matrix. ^[76]

Model Optimization and Training Infrastructure

Specialized techniques and frameworks are used for training and deploying LLMs.

Quantization: Reducing the bit precision of weights decreases memory usage and accelerates inference. QLoRA enables efficient fine-tuning of 4-bit models (including 65B models) with quality close to full precision^[77].
Knowledge Distillation: Teacher→Student training for compact models^[78]; an example is DistilBERT^[79].
Distributed Training:
- DeepSpeed and ZeRO distribute parameters/gradients/optimizer states to train trillion-parameter models^[80].
- Megatron-LM uses tensor and pipeline parallelism for very large transformers^[81].
Ecosystem and Tools: Hugging Face Transformers and Accelerate provide standard model implementations and integration with DeepSpeed/FSDP for training and inference^[82]^[83].

Scaling Laws and Compute-Optimal Training

Empirical scaling laws demonstrate that cross-entropy loss decreases as a power law with increases in parameters, data, and computation. ^[84] The Chinchilla paper refined the compute-optimal regimes: for optimal efficiency, the model size and the number of training tokens should be scaled together (e.g., a 70B model trained on ~1.4T tokens outperforms larger, under-trained models). ^[85]

State Space Models (SSM)

State Space Models (SSM) are an alternative architecture to Transformers for processing long sequences. They borrow ideas from control theory and digital signal processing and address the main problem of self-attention: the quadratic growth of computation with increasing text length.

The Core Problem and Solution

The Transformer Problem. The main issue with traditional transformers is the quadratic complexity of attention: text that is 10 times longer requires about 100 times more computation.

The SSM Approach. Instead of "simultaneous attention to all words," the model processes the text sequentially and maintains a compact internal memory state that is updated at each step. As a result, time and memory consumption grow approximately linearly with text length. At the same time, training can be performed in parallel through a convolutional representation of the kernel (high throughput on long sequences). ^[86]

Principle of Operation

A discrete SSM is described by the state and output equations:

x_{t} = A x_{t - 1} + B u_{t}, y_{t} = C x_{t} + D u_{t}

where $x_{t}$ is the memory state, $u_{t}$ is the input (token), and $y_{t}$ is the output. In deep SSMs, the matrices $A, B, C, D$ are parameterized to ensure stability and efficient computation on long sequences. The same layer can be viewed as:

recurrent (scanning step-by-step) — for memory-efficient inference without a KV cache;
convolutional — for parallel training with a pre-computed kernel. ^[86]

Main Architectures and Hybrids

S4 (Structured State Spaces). The baseline SSM with a stable parameterization of the state matrix; demonstrates efficiency on very long sequences. ^[86]
Mamba. Selective SSMs: the memory update rules depend on the current input (the model decides what to "keep in memory" and what to "forget"). The implementation is optimized for the GPU memory hierarchy; according to the authors, it achieves a multiple-fold increase in inference throughput with linear complexity in sequence length. ^[87]
RetNet. A retention mechanism with three modes: parallel training, recurrent inference, and chunk-wise recurrent inference. The goal is to combine fast training (like Transformers) with efficient streaming inference (O(1) memory per token). ^[88]
Attention+SSM Hybrids. An example is Jamba (alternating Transformer and Mamba layers, plus MoE): it reports support for contexts of around ~256K tokens with significantly lower memory requirements compared to pure transformer models of a similar class. ^[89]

Advantages

Linear complexity and memory efficiency during inference. No global self-attention or KV cache; only a compact state is stored. ^[87]^[88]
Parallel training on long sequences. The convolutional mode increases training throughput. ^[86]
Hardware efficiency. Implementations are optimized for the modern memory hierarchy (HBM/SRAM). ^[87]
Long contexts and streaming. SSM+Attention hybrids are practical for hundreds of thousands of tokens with moderate resources. ^[89]

Limitations and Current Practice

Ecosystem maturity. Tools and "recipes" for scaling (instruction tuning, RLHF/DPO) are not yet as developed as the Transformer stack. ^[87]
Quality and stability. On some tasks, hybrids (Attention+SSM) show a more stable trade-off between quality, speed, and memory than "pure" SSMs. ^[89]

Comparison of Approaches (Generalized)

Characteristic	Transformers	SSM	Hybrids (Attention+SSM)
Complexity by length	Quadratic (self-attention)	Linear (scan/convolution)	Close to linear
Memory per token (inference)	KV cache grows with context	O(1) state	Moderate growth
Long contexts	Requires special optimizations	Natively supported	Practical up to ~256K
Ecosystem maturity	High	Developing	Developing

Practical Applications

Analysis of very long documents (books, reports, scientific reviews).
Stream processing and chat scenarios with long histories without increased memory costs.
Environments with limited resources (mobile/edge devices).
Time series and other sequential data.

Representative models: S4, Mamba, RetNet; Attention+SSM hybrids (Jamba). ^[86]^[87]^[88]^[89]

Evolution of Architectures

2017 — The paper "Attention Is All You Need" is published. It introduces the Transformer architecture: multi-head self-attention and positional encodings allow models to be trained without recurrence or convolutions; however, attention has quadratic complexity with respect to context length.^[1]

2018 — GPT-1 and BERT are introduced. GPT-1 uses a decoder-only stack with causal attention for generation and subsequent fine-tuning; BERT introduces a bidirectional encoder and MLM pre-training for text understanding tasks. ^[90]^[91]

2019 — Methods for handling long sequences are proposed, and the decoder-only approach is scaled up. Transformer-XL adds "memory" and relative positions to extend beyond a fixed window; GPT-2 demonstrates the growth of zero-shot capabilities with increased scale; BART shows the effectiveness of denoising pre-training for seq2seq. ^[92]^[93]^[94]

2020 — The "text-to-text" format is unified, and methods for long documents are shown. T5 formulates a unified encoder-decoder approach for various tasks; Longformer and BigBird use sparse/structured attention for long texts; GPT-3 confirms the effectiveness of scaling dense decoder-only models. ^[95]^[96]^[97]^[98]

2021 — Positional representations are improved, and parameter sparsity (MoE) is demonstrated. RoPE and ALiBi improve generalization to longer lengths; Switch Transformer and GLaM activate only a subset of experts per token, increasing capacity without a proportional increase in inference cost. ^[99]^[100]^[101]^[102]

2022 — The compute-optimal regime is refined, and inference on long prompts is accelerated. Chinchilla shows the benefit of more training tokens with a moderate model size; PaLM with Multi-Query Attention reduces KV cache size; FlashAttention speeds up attention on GPUs. ^[103]^[104]^[105]^[106]

2023 — Context windows are extended without layer modifications, and server-side delivery is improved. The LLaMA series solidifies best practices (RMSNorm, SwiGLU, RoPE); Position Interpolation and YaRN extend context; vLLM/PagedAttention more efficiently manages the KV cache. ^[107]^[108]^[109]^[110]^[111]^[112]

2023 — GPT-4 and Gemini demonstrate multi-modal processing and generation within a single family of models. ^[113]^[114]

2023 — State Space Models (SSM) are proposed. Mamba and RetNet bring back sequential processing with a compact state instead of a KV cache, laying the groundwork for hybrid architectures. ^[115]^[116]

2024 — Open-source MoE models and Attention+SSM hybrids are published; attention is accelerated on new GPUs. Mixtral 8×7B/8×22B and DBRX confirm the practicality of MoE; Jamba combines Transformer and Mamba for very long contexts; FlashAttention-3 increases throughput. ^[117]^[118]^[119]^[120]^[121]

Links

https://jalammar.github.io/illustrated-transformer/ The Illustrated Transformer — a visual explanation

Literature

Vaswani, A. et al. (2017). Attention Is All You Need. NIPS. https://arxiv.org/abs/1706.03762
Devlin, J. et al. (2019). BERT. NAACL. https://arxiv.org/abs/1810.04805
Brown, T. et al. (2020). Language Models are Few‑Shot Learners. NeurIPS. https://arxiv.org/abs/2005.14165
Raffel, C. et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text‑to‑Text Transformer (T5). JMLR. https://jmlr.org/papers/volume21/20-074/20-074.pdf
Lewis, M. et al. (2019). BART: Denoising Sequence‑to‑Sequence Pre‑training. https://arxiv.org/abs/1910.13461
Touvron, H. et al. (2023). LLaMA. https://arxiv.org/abs/2302.13971
Chowdhery, A. et al. (2022). PaLM: Scaling Language Modeling with Pathways. https://arxiv.org/abs/2204.02311
Dao, T. et al. (2022–2024). FlashAttention (1/2/3). https://arxiv.org/abs/2205.14135 ; https://arxiv.org/abs/2307.08691 ; https://arxiv.org/abs/2407.08608
Shazeer, N. (2019). MQA. https://arxiv.org/abs/1911.02150
Ainslie, J. et al. (2023). GQA. https://arxiv.org/abs/2305.13245
Kwon, W. et al. (2023). PagedAttention / vLLM. https://arxiv.org/abs/2309.06180
Leviathan, Y. et al. (2023). Speculative Decoding. https://arxiv.org/abs/2211.17192
Fedus, W.; Zoph, B.; Shazeer, N. (2021/2022). Switch Transformers. https://arxiv.org/abs/2101.03961
Du, N. et al. (2022). GLaM. https://proceedings.mlr.press/v162/du22c/du22c.pdf
Jiang, A.Q. et al. (2024). Mixtral of Experts. https://arxiv.org/abs/2401.04088
Databricks (2024). Introducing DBRX. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
NVIDIA (2024). Applying Mixture of Experts in LLM Architectures. https://developer.nvidia.com/blog/applying-mixture-of-experts-in-llm-architectures/
Zhou, Y. et al. (2022). Expert Choice Routing. https://arxiv.org/abs/2202.09368
Komatsuzaki, A. et al. (2022). Sparse Upcycling. https://arxiv.org/abs/2212.05055
Lewis, P. et al. (2020). RAG. https://arxiv.org/abs/2005.11401
Beltagy, I. et al. (2020). Longformer. https://arxiv.org/abs/2004.05150
Zaheer, M. et al. (2020). BigBird. https://arxiv.org/abs/2007.14062
Press, O. et al. (2022). ALiBi. https://arxiv.org/abs/2108.12409
Su, J. et al. (2021). RoFormer (RoPE). https://arxiv.org/abs/2104.09864
Chen, S. et al. (2023). Position Interpolation. https://arxiv.org/abs/2306.15595
Peng, B. et al. (2023). YaRN. https://arxiv.org/abs/2309.00071
Dettmers, T. et al. (2023). QLoRA. https://arxiv.org/abs/2305.14314
Rajbhandari, S. et al. (2020). ZeRO. https://www.microsoft.com/en-us/research/publication/zero-memory-optimizations-toward-training-trillion-parameter-models/
Shoeybi, M. et al. (2019). Megatron‑LM. https://arxiv.org/abs/1909.08053
Kaplan, J. et al. (2020). Scaling Laws. https://arxiv.org/abs/2001.08361
Hoffmann, J. et al. (2022). Chinchilla / Compute‑Optimal. https://arxiv.org/abs/2203.15556
Gemini Team (2023). Gemini. https://arxiv.org/abs/2312.11805
Bai, Y. et al. (2022). Constitutional AI. https://arxiv.org/abs/2212.08073
OpenAI (2023). GPT‑4 Technical Report. https://arxiv.org/abs/2303.08774
OpenAI (2023). DevDay: GPT‑4 Turbo 128k. https://openai.com/index/new-models-and-developer-products-announced-at-devday/
Zhang, B.; Sennrich, R. (2019). RMSNorm. https://arxiv.org/abs/1910.07467
Shazeer, N. (2020). GLU Variants / SwiGLU. https://arxiv.org/abs/2002.05202
Gu, A.; Goel, K.; Ré, C. (2021). S4: Structured State Spaces. https://arxiv.org/abs/2111.00396
Gu, A.; Dao, T. (2023/2024). Mamba: Selective State Spaces. https://arxiv.org/abs/2312.00752
Sun, Y. et al. (2023). RetNet. https://arxiv.org/abs/2307.08621
Lieber, O. et al. (2024). Jamba: Hybrid Transformer‑Mamba. https://arxiv.org/abs/2403.19887
Dai, Z. et al. (2019). Transformer‑XL. https://arxiv.org/abs/1901.02860
Kitaev, N.; Kaiser, L.; Levskaya, A. (2020). Reformer. https://arxiv.org/abs/2001.04451
Choromanski, K. et al. (2021). Performer. https://arxiv.org/abs/2009.14794
Wang, S. et al. (2020). Linformer. https://arxiv.org/abs/2006.04768

Notes

↑ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 ^1.6 Vaswani, A. et al. (2017). Attention Is All You Need. https://arxiv.org/abs/1706.03762
↑ Devlin, J. et al. (2019). BERT. https://arxiv.org/abs/1810.04805
↑ Brown, T. et al. (2020). Language Models are Few‑Shot Learners. https://arxiv.org/abs/2005.14165
↑ Raffel, C. et al. (2020). T5. https://jmlr.org/papers/volume21/20-074/20-074.pdf
↑ Devlin, J. et al. (2019). BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805
↑ Liu, Y. et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://arxiv.org/abs/1907.11692
↑ He, P. et al. (2021). DeBERTa: Decoding‑enhanced BERT with Disentangled Attention. https://arxiv.org/abs/2006.03654
↑ Clark, K. et al. (2020). ELECTRA: Pre‑training Text Encoders as Discriminators Rather Than Generators. https://arxiv.org/abs/2003.10555
↑ Zaheer, M. et al. (2020). Big Bird: Transformers for Longer Sequences. https://arxiv.org/abs/2007.14062
↑ Beltagy, I. et al. (2020). Longformer: The Long‑Document Transformer. https://arxiv.org/abs/2004.05150
↑ Shazeer, N. (2019). Fast Transformer Decoding: One Write‑Head is All You Need (Multi‑Query Attention). https://arxiv.org/abs/1911.02150
↑ Ainslie, J. et al. (2023). GQA: Training Generalized Multi‑Query Transformer Models from Multi‑Head Checkpoints. https://arxiv.org/abs/2305.13245
↑ Leviathan, Y. et al. (2023). Fast Inference from Transformers via Speculative Decoding. https://arxiv.org/abs/2211.17192
↑ Kwon, W. et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention (vLLM). https://arxiv.org/abs/2309.06180
↑ vLLM Docs (2024–2025). Continuous batching, Chunked prefill, Structured outputs. https://docs.vllm.ai/
↑ Brown, T. et al. (2020). Language Models are Few‑Shot Learners. https://arxiv.org/abs/2005.14165
↑ Ouyang, L. et al. (2022). InstructGPT (RLHF). https://arxiv.org/abs/2203.02155
↑ Rafailov, R. et al. (2023). Direct Preference Optimization. https://arxiv.org/abs/2305.18290
↑ Shazeer, 2019. https://arxiv.org/abs/1911.02150
↑ Ainslie, 2023. https://arxiv.org/abs/2305.13245
↑ Leviathan, 2023. https://arxiv.org/abs/2211.17192
↑ Kwon, 2023. https://arxiv.org/abs/2309.06180
↑ vLLM Docs. https://docs.vllm.ai/
↑ OpenAI (2024). Structured Outputs. https://openai.com/index/introducing-structured-outputs-in-the-api/
↑ vLLM Docs — Structured outputs. https://docs.vllm.ai/en/v0.9.2/features/structured_outputs.html
↑ Kwon, 2023. https://arxiv.org/abs/2309.06180
↑ Brown, T. et al. (2020). Language Models are Few‑Shot Learners. https://arxiv.org/abs/2005.14165
↑ Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971
↑ Achiam, J. et al. (2023). GPT‑4 Technical Report. https://arxiv.org/abs/2303.08774
↑ Meta AI (2024). Introducing Meta Llama 3. https://ai.meta.com/blog/meta-llama-3/
↑ Raffel, C. et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text‑to‑Text Transformer (T5). JMLR. https://jmlr.org/papers/volume21/20-074/20-074.pdf
↑ Lewis, M. et al. (2019). BART: Denoising Sequence‑to‑Sequence Pre‑training. https://arxiv.org/abs/1910.13461
↑ Raffel, C. et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text‑to‑Text Transformer. JMLR. https://jmlr.org/papers/volume21/20-074/20-074.pdf
↑ Lewis, M. et al. (2019). BART: Denoising Sequence‑to‑Sequence Pre‑training for NLG, Translation, and Comprehension. https://arxiv.org/abs/1910.13461
↑ Chung, H. W. et al. (2022). Scaling Instruction‑Finetuned Language Models (FLAN‑T5). https://arxiv.org/abs/2210.11416
↑ Shazeer, N. (2020). GLU Variants Improve Transformer. https://arxiv.org/abs/2002.05202
↑ Devlin, J. et al. (2019). BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805
↑ Brown, T. et al. (2020). Language Models are Few‑Shot Learners. https://arxiv.org/abs/2005.14165
↑ Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971
↑ Chowdhery, A. et al. (2022). PaLM: Scaling Language Modeling with Pathways. https://arxiv.org/abs/2204.02311
↑ Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. https://arxiv.org/abs/2001.08361
↑ Hoffmann, J. et al. (2022). Training Compute‑Optimal Large Language Models. https://arxiv.org/abs/2203.15556
↑ Chen, S. et al. (2023). Extending Context Window via Positional Interpolation. https://arxiv.org/abs/2306.15595
↑ Peng, B. et al. (2023). YaRN: Efficient Context Window Extension of LLMs. https://arxiv.org/abs/2309.00071
↑ Dao, T. et al. (2022–2024). FlashAttention (1/2/3). https://arxiv.org/abs/2205.14135 ; https://arxiv.org/abs/2307.08691 ; https://arxiv.org/abs/2407.08608
↑ Shazeer, N. (2019). Fast Transformer Decoding: One Write‑Head is All You Need. https://arxiv.org/abs/1911.02150
↑ Ainslie, J. et al. (2023). GQA. https://arxiv.org/abs/2305.13245
↑ Kwon, W. et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention. https://arxiv.org/abs/2309.06180
↑ Leviathan, Y. et al. (2023). Fast Inference from Transformers via Speculative Decoding. https://arxiv.org/abs/2211.17192
↑ ^50.0 ^50.1 ^50.2 Fedus, W.; Zoph, B.; Shazeer, N. (2021/2022). Switch Transformers. https://arxiv.org/abs/2101.03961
↑ ^51.0 ^51.1 ^51.2 Du, N. et al. (2021). GLaM: Efficient Scaling of Language Models with Mixture‑of‑Experts. https://arxiv.org/pdf/2112.06905.pdf
↑ Mistral AI (2023). Mixtral of Experts. https://mistral.ai/news/mixtral-of-experts/
↑ Jiang, A.Q. et al. (2024). Mixtral of Experts. https://arxiv.org/abs/2401.04088
↑ Mistral AI (2024). Mixtral 8x22B. https://mistral.ai/news/mixtral-8x22b
↑ ^55.0 ^55.1 Databricks (2024). Introducing DBRX. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
↑ NVIDIA (2024). Applying Mixture of Experts in LLM Architectures. https://developer.nvidia.com/blog/applying-mixture-of-experts-in-llm-architectures/
↑ Zhou, Y. et al. (2022). Mixture‑of‑Experts with Expert Choice Routing. https://arxiv.org/abs/2202.09368
↑ Komatsuzaki, A. et al. (2022). Sparse Upcycling: Training Mixture‑of‑Experts from Dense Checkpoints. https://arxiv.org/abs/2212.05055
↑ ^59.0 ^59.1 ^59.2 ^59.3 Lewis, P. et al. (2020). Retrieval‑Augmented Generation for Knowledge‑Intensive NLP Tasks. https://arxiv.org/abs/2005.11401
↑ NVIDIA Blog (2025). What is Retrieval‑Augmented Generation (RAG). https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/
↑ Beltagy, I. et al. (2020). Longformer. https://arxiv.org/abs/2004.05150
↑ Zaheer, M. et al. (2020). Big Bird. https://arxiv.org/abs/2007.14062
↑ Dao, T. et al. (2022). FlashAttention. https://arxiv.org/abs/2205.14135
↑ Dao, T. et al. (2023). FlashAttention‑2. https://arxiv.org/abs/2307.08691
↑ Shah, M. et al. (2024). FlashAttention‑3. https://arxiv.org/abs/2407.08608
↑ Shazeer, N. (2019). Fast Transformer Decoding: One Write‑Head is All You Need. https://arxiv.org/abs/1911.02150
↑ Ainslie, J. et al. (2023). GQA. https://arxiv.org/abs/2305.13245
↑ Press, O. et al. (2022). ALiBi. https://arxiv.org/abs/2108.12409
↑ Su, J. et al. (2021). RoFormer: Rotary Position Embedding. https://arxiv.org/abs/2104.09864
↑ Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971
↑ Chen, S. et al. (2023). Extending Context Window via Positional Interpolation. https://arxiv.org/abs/2306.15595
↑ Peng, B. et al. (2023). YaRN: Efficient Context Window Extension of LLMs. https://arxiv.org/abs/2309.00071
↑ Dai, Z. et al. (2019). Transformer‑XL: Attentive Language Models Beyond a Fixed‑Length Context. https://arxiv.org/abs/1901.02860
↑ Kitaev, N.; Kaiser, L.; Levskaya, A. (2020). Reformer: The Efficient Transformer. https://arxiv.org/abs/2001.04451
↑ Choromanski, K. et al. (2021). Rethinking Attention with Performers. https://arxiv.org/abs/2009.14794
↑ Wang, S. et al. (2020). Linformer: Self‑Attention with Linear Complexity. https://arxiv.org/abs/2006.04768
↑ Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. https://arxiv.org/abs/2305.14314
↑ Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. https://arxiv.org/abs/1503.02531
↑ Sanh, V. et al. (2019). DistilBERT. https://arxiv.org/abs/1910.01108
↑ Rajbhandari, S. et al. (2020). ZeRO: Memory Optimizations Toward Training Trillion‑Parameter Models. https://www.microsoft.com/en-us/research/publication/zero-memory-optimizations-toward-training-trillion-parameter-models/
↑ Shoeybi, M. et al. (2019). Megatron‑LM: Training Multi‑Billion Parameter Language Models Using Model Parallelism. https://arxiv.org/abs/1909.08053
↑ Hugging Face. Transformers Documentation. https://huggingface.co/docs/transformers
↑ Hugging Face. Accelerate Documentation. https://huggingface.co/docs/accelerate
↑ Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. https://arxiv.org/abs/2001.08361
↑ Hoffmann, J. et al. (2022). Training Compute‑Optimal Large Language Models. https://arxiv.org/abs/2203.15556
↑ ^86.0 ^86.1 ^86.2 ^86.3 ^86.4 Gu, A.; Goel, K.; Ré, C. (2021). Efficiently Modeling Long Sequences with Structured State Spaces (S4). https://arxiv.org/abs/2111.00396
↑ ^87.0 ^87.1 ^87.2 ^87.3 ^87.4 Gu, A.; Dao, T. (2023/2024). Mamba: Linear‑Time Sequence Modeling with Selective State Spaces. https://arxiv.org/abs/2312.00752
↑ ^88.0 ^88.1 ^88.2 Sun, Y. et al. (2023). Retentive Network: A Successor to Transformer for Large Language Models. https://arxiv.org/abs/2307.08621
↑ ^89.0 ^89.1 ^89.2 ^89.3 Lieber, O. et al. (2024). Jamba: A Hybrid Transformer‑Mamba Language Model. https://arxiv.org/abs/2403.19887
↑ Radford, A. et al. (2018). Improving Language Understanding by Generative Pre‑Training. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
↑ Devlin, J. et al. (2019). BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805
↑ Dai, Z. et al. (2019). Transformer‑XL. https://arxiv.org/abs/1901.02860
↑ Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
↑ Lewis, M. et al. (2019). BART. https://arxiv.org/abs/1910.13461
↑ Raffel, C. et al. (2020). T5. https://jmlr.org/papers/volume21/20-074/20-074.pdf
↑ Beltagy, I. et al. (2020). Longformer. https://arxiv.org/abs/2004.05150
↑ Zaheer, M. et al. (2020). BigBird. https://arxiv.org/abs/2007.14062
↑ Brown, T. et al. (2020). Language Models are Few‑Shot Learners. https://arxiv.org/abs/2005.14165
↑ Su, J. et al. (2021). RoPE. https://arxiv.org/abs/2104.09864
↑ Press, O. et al. (2021/2022). ALiBi. https://arxiv.org/abs/2108.12409
↑ Fedus, W.; Zoph, B.; Shazeer, N. (2021/2022). Switch Transformers. https://arxiv.org/abs/2101.03961
↑ Du, N. et al. (2021). GLaM. https://arxiv.org/pdf/2112.06905.pdf
↑ Hoffmann, J. et al. (2022). Chinchilla. https://arxiv.org/abs/2203.15556
↑ Chowdhery, A. et al. (2022). PaLM. https://arxiv.org/abs/2204.02311
↑ Shazeer, N. (2019). Fast Transformer Decoding. https://arxiv.org/abs/1911.02150
↑ Dao, T. et al. (2022). FlashAttention. https://arxiv.org/abs/2205.14135
↑ Touvron, H. et al. (2023). LLaMA. https://arxiv.org/abs/2302.13971
↑ Zhang, B.; Sennrich, R. (2019). RMSNorm. https://arxiv.org/abs/1910.07467
↑ Shazeer, N. (2020). GLU Variants. https://arxiv.org/abs/2002.05202
↑ Chen, S. et al. (2023). Position Interpolation. https://arxiv.org/abs/2306.15595
↑ Peng, B. et al. (2023). YaRN. https://arxiv.org/abs/2309.00071
↑ Kwon, W. et al. (2023). vLLM/PagedAttention. https://arxiv.org/abs/2309.06180
↑ OpenAI (2023). GPT‑4 Technical Report. https://arxiv.org/abs/2303.08774
↑ Gemini Team (2023). Gemini. https://arxiv.org/abs/2312.11805
↑ Gu, A.; Dao, T. (2023). Mamba. https://arxiv.org/abs/2312.00752
↑ Sun, Y. et al. (2023). RetNet. https://arxiv.org/abs/2307.08621
↑ Jiang, A.Q. et al. (2024). Mixtral of Experts. https://arxiv.org/abs/2401.04088
↑ Mistral AI (2024). Mixtral 8x22B. https://mistral.ai/news/mixtral-8x22b
↑ Databricks (2024). Introducing DBRX. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
↑ Lieber, O. et al. (2024). Jamba. https://arxiv.org/abs/2403.19887
↑ Shah, M. et al. (2024). FlashAttention‑3. https://arxiv.org/abs/2407.08608

[Vaswani2017-1] 1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 ^1.6 Vaswani, A. et al. (2017). Attention Is All You Need. https://arxiv.org/abs/1706.03762

[2] Devlin, J. et al. (2019). BERT. https://arxiv.org/abs/1810.04805

[3] Brown, T. et al. (2020). Language Models are Few‑Shot Learners. https://arxiv.org/abs/2005.14165

[4] Raffel, C. et al. (2020). T5. https://jmlr.org/papers/volume21/20-074/20-074.pdf

[5] Devlin, J. et al. (2019). BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805

[6] Liu, Y. et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://arxiv.org/abs/1907.11692

[7] He, P. et al. (2021). DeBERTa: Decoding‑enhanced BERT with Disentangled Attention. https://arxiv.org/abs/2006.03654

[8] Clark, K. et al. (2020). ELECTRA: Pre‑training Text Encoders as Discriminators Rather Than Generators. https://arxiv.org/abs/2003.10555

[9] Zaheer, M. et al. (2020). Big Bird: Transformers for Longer Sequences. https://arxiv.org/abs/2007.14062

[10] Beltagy, I. et al. (2020). Longformer: The Long‑Document Transformer. https://arxiv.org/abs/2004.05150

[11] Shazeer, N. (2019). Fast Transformer Decoding: One Write‑Head is All You Need (Multi‑Query Attention). https://arxiv.org/abs/1911.02150

[12] Ainslie, J. et al. (2023). GQA: Training Generalized Multi‑Query Transformer Models from Multi‑Head Checkpoints. https://arxiv.org/abs/2305.13245

[13] Leviathan, Y. et al. (2023). Fast Inference from Transformers via Speculative Decoding. https://arxiv.org/abs/2211.17192

[14] Kwon, W. et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention (vLLM). https://arxiv.org/abs/2309.06180

[15] vLLM Docs (2024–2025). Continuous batching, Chunked prefill, Structured outputs. https://docs.vllm.ai/

[16] Brown, T. et al. (2020). Language Models are Few‑Shot Learners. https://arxiv.org/abs/2005.14165

[17] Ouyang, L. et al. (2022). InstructGPT (RLHF). https://arxiv.org/abs/2203.02155

[18] Rafailov, R. et al. (2023). Direct Preference Optimization. https://arxiv.org/abs/2305.18290

[19] Shazeer, 2019. https://arxiv.org/abs/1911.02150

[20] Ainslie, 2023. https://arxiv.org/abs/2305.13245

[21] Leviathan, 2023. https://arxiv.org/abs/2211.17192

[22] Kwon, 2023. https://arxiv.org/abs/2309.06180

[23] vLLM Docs. https://docs.vllm.ai/

[24] OpenAI (2024). Structured Outputs. https://openai.com/index/introducing-structured-outputs-in-the-api/

[25] vLLM Docs — Structured outputs. https://docs.vllm.ai/en/v0.9.2/features/structured_outputs.html

[26] Kwon, 2023. https://arxiv.org/abs/2309.06180

[27] Brown, T. et al. (2020). Language Models are Few‑Shot Learners. https://arxiv.org/abs/2005.14165

[28] Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971

[29] Achiam, J. et al. (2023). GPT‑4 Technical Report. https://arxiv.org/abs/2303.08774

[30] Meta AI (2024). Introducing Meta Llama 3. https://ai.meta.com/blog/meta-llama-3/

[31] Raffel, C. et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text‑to‑Text Transformer (T5). JMLR. https://jmlr.org/papers/volume21/20-074/20-074.pdf

[32] Lewis, M. et al. (2019). BART: Denoising Sequence‑to‑Sequence Pre‑training. https://arxiv.org/abs/1910.13461

[33] Raffel, C. et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text‑to‑Text Transformer. JMLR. https://jmlr.org/papers/volume21/20-074/20-074.pdf

[34] Lewis, M. et al. (2019). BART: Denoising Sequence‑to‑Sequence Pre‑training for NLG, Translation, and Comprehension. https://arxiv.org/abs/1910.13461

[35] Chung, H. W. et al. (2022). Scaling Instruction‑Finetuned Language Models (FLAN‑T5). https://arxiv.org/abs/2210.11416

[36] Shazeer, N. (2020). GLU Variants Improve Transformer. https://arxiv.org/abs/2002.05202

[37] Devlin, J. et al. (2019). BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805

[38] Brown, T. et al. (2020). Language Models are Few‑Shot Learners. https://arxiv.org/abs/2005.14165

[39] Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971

[40] Chowdhery, A. et al. (2022). PaLM: Scaling Language Modeling with Pathways. https://arxiv.org/abs/2204.02311

[41] Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. https://arxiv.org/abs/2001.08361

[42] Hoffmann, J. et al. (2022). Training Compute‑Optimal Large Language Models. https://arxiv.org/abs/2203.15556

[43] Chen, S. et al. (2023). Extending Context Window via Positional Interpolation. https://arxiv.org/abs/2306.15595

[44] Peng, B. et al. (2023). YaRN: Efficient Context Window Extension of LLMs. https://arxiv.org/abs/2309.00071

[45] Dao, T. et al. (2022–2024). FlashAttention (1/2/3). https://arxiv.org/abs/2205.14135 ; https://arxiv.org/abs/2307.08691 ; https://arxiv.org/abs/2407.08608

[46] Shazeer, N. (2019). Fast Transformer Decoding: One Write‑Head is All You Need. https://arxiv.org/abs/1911.02150

[47] Ainslie, J. et al. (2023). GQA. https://arxiv.org/abs/2305.13245

[48] Kwon, W. et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention. https://arxiv.org/abs/2309.06180

[49] Leviathan, Y. et al. (2023). Fast Inference from Transformers via Speculative Decoding. https://arxiv.org/abs/2211.17192

[Switch-50] 50.0 ^50.1 ^50.2 Fedus, W.; Zoph, B.; Shazeer, N. (2021/2022). Switch Transformers. https://arxiv.org/abs/2101.03961

[GLAM-51] 51.0 ^51.1 ^51.2 Du, N. et al. (2021). GLaM: Efficient Scaling of Language Models with Mixture‑of‑Experts. https://arxiv.org/pdf/2112.06905.pdf

[Mixtral8x7-52] Mistral AI (2023). Mixtral of Experts. https://mistral.ai/news/mixtral-of-experts/

[Mixtral8x7_paper-53] Jiang, A.Q. et al. (2024). Mixtral of Experts. https://arxiv.org/abs/2401.04088

[Mixtral8x22-54] Mistral AI (2024). Mixtral 8x22B. https://mistral.ai/news/mixtral-8x22b

[DBRX-55] 55.0 ^55.1 Databricks (2024). Introducing DBRX. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

[NVIDIA_MoE-56] NVIDIA (2024). Applying Mixture of Experts in LLM Architectures. https://developer.nvidia.com/blog/applying-mixture-of-experts-in-llm-architectures/

[ExpertChoice-57] Zhou, Y. et al. (2022). Mixture‑of‑Experts with Expert Choice Routing. https://arxiv.org/abs/2202.09368

[SparseUpcycling-58] Komatsuzaki, A. et al. (2022). Sparse Upcycling: Training Mixture‑of‑Experts from Dense Checkpoints. https://arxiv.org/abs/2212.05055

[RAG-59] 59.0 ^59.1 ^59.2 ^59.3 Lewis, P. et al. (2020). Retrieval‑Augmented Generation for Knowledge‑Intensive NLP Tasks. https://arxiv.org/abs/2005.11401

[60] NVIDIA Blog (2025). What is Retrieval‑Augmented Generation (RAG). https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/

[61] Beltagy, I. et al. (2020). Longformer. https://arxiv.org/abs/2004.05150

[62] Zaheer, M. et al. (2020). Big Bird. https://arxiv.org/abs/2007.14062

[63] Dao, T. et al. (2022). FlashAttention. https://arxiv.org/abs/2205.14135

[64] Dao, T. et al. (2023). FlashAttention‑2. https://arxiv.org/abs/2307.08691

[65] Shah, M. et al. (2024). FlashAttention‑3. https://arxiv.org/abs/2407.08608

[66] Shazeer, N. (2019). Fast Transformer Decoding: One Write‑Head is All You Need. https://arxiv.org/abs/1911.02150

[67] Ainslie, J. et al. (2023). GQA. https://arxiv.org/abs/2305.13245

[68] Press, O. et al. (2022). ALiBi. https://arxiv.org/abs/2108.12409

[69] Su, J. et al. (2021). RoFormer: Rotary Position Embedding. https://arxiv.org/abs/2104.09864

[70] Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971

[71] Chen, S. et al. (2023). Extending Context Window via Positional Interpolation. https://arxiv.org/abs/2306.15595

[72] Peng, B. et al. (2023). YaRN: Efficient Context Window Extension of LLMs. https://arxiv.org/abs/2309.00071

[73] Dai, Z. et al. (2019). Transformer‑XL: Attentive Language Models Beyond a Fixed‑Length Context. https://arxiv.org/abs/1901.02860

[74] Kitaev, N.; Kaiser, L.; Levskaya, A. (2020). Reformer: The Efficient Transformer. https://arxiv.org/abs/2001.04451

[75] Choromanski, K. et al. (2021). Rethinking Attention with Performers. https://arxiv.org/abs/2009.14794

[76] Wang, S. et al. (2020). Linformer: Self‑Attention with Linear Complexity. https://arxiv.org/abs/2006.04768

[77] Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. https://arxiv.org/abs/2305.14314

[78] Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. https://arxiv.org/abs/1503.02531

[79] Sanh, V. et al. (2019). DistilBERT. https://arxiv.org/abs/1910.01108

[80] Rajbhandari, S. et al. (2020). ZeRO: Memory Optimizations Toward Training Trillion‑Parameter Models. https://www.microsoft.com/en-us/research/publication/zero-memory-optimizations-toward-training-trillion-parameter-models/

[81] Shoeybi, M. et al. (2019). Megatron‑LM: Training Multi‑Billion Parameter Language Models Using Model Parallelism. https://arxiv.org/abs/1909.08053

[82] Hugging Face. Transformers Documentation. https://huggingface.co/docs/transformers

[83] Hugging Face. Accelerate Documentation. https://huggingface.co/docs/accelerate

[84] Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. https://arxiv.org/abs/2001.08361

[85] Hoffmann, J. et al. (2022). Training Compute‑Optimal Large Language Models. https://arxiv.org/abs/2203.15556

[S4-86] 86.0 ^86.1 ^86.2 ^86.3 ^86.4 Gu, A.; Goel, K.; Ré, C. (2021). Efficiently Modeling Long Sequences with Structured State Spaces (S4). https://arxiv.org/abs/2111.00396

[Mamba-87] 87.0 ^87.1 ^87.2 ^87.3 ^87.4 Gu, A.; Dao, T. (2023/2024). Mamba: Linear‑Time Sequence Modeling with Selective State Spaces. https://arxiv.org/abs/2312.00752

[RetNet-88] 88.0 ^88.1 ^88.2 Sun, Y. et al. (2023). Retentive Network: A Successor to Transformer for Large Language Models. https://arxiv.org/abs/2307.08621

[Jamba-89] 89.0 ^89.1 ^89.2 ^89.3 Lieber, O. et al. (2024). Jamba: A Hybrid Transformer‑Mamba Language Model. https://arxiv.org/abs/2403.19887

[90] Radford, A. et al. (2018). Improving Language Understanding by Generative Pre‑Training. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

[91] Devlin, J. et al. (2019). BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805

[92] Dai, Z. et al. (2019). Transformer‑XL. https://arxiv.org/abs/1901.02860

[93] Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

[94] Lewis, M. et al. (2019). BART. https://arxiv.org/abs/1910.13461

[95] Raffel, C. et al. (2020). T5. https://jmlr.org/papers/volume21/20-074/20-074.pdf

[96] Beltagy, I. et al. (2020). Longformer. https://arxiv.org/abs/2004.05150

[97] Zaheer, M. et al. (2020). BigBird. https://arxiv.org/abs/2007.14062

[98] Brown, T. et al. (2020). Language Models are Few‑Shot Learners. https://arxiv.org/abs/2005.14165

[99] Su, J. et al. (2021). RoPE. https://arxiv.org/abs/2104.09864

[100] Press, O. et al. (2021/2022). ALiBi. https://arxiv.org/abs/2108.12409

[101] Fedus, W.; Zoph, B.; Shazeer, N. (2021/2022). Switch Transformers. https://arxiv.org/abs/2101.03961

[102] Du, N. et al. (2021). GLaM. https://arxiv.org/pdf/2112.06905.pdf

[103] Hoffmann, J. et al. (2022). Chinchilla. https://arxiv.org/abs/2203.15556

[104] Chowdhery, A. et al. (2022). PaLM. https://arxiv.org/abs/2204.02311

[105] Shazeer, N. (2019). Fast Transformer Decoding. https://arxiv.org/abs/1911.02150

[106] Dao, T. et al. (2022). FlashAttention. https://arxiv.org/abs/2205.14135

[107] Touvron, H. et al. (2023). LLaMA. https://arxiv.org/abs/2302.13971

[108] Zhang, B.; Sennrich, R. (2019). RMSNorm. https://arxiv.org/abs/1910.07467

[109] Shazeer, N. (2020). GLU Variants. https://arxiv.org/abs/2002.05202

[110] Chen, S. et al. (2023). Position Interpolation. https://arxiv.org/abs/2306.15595

[111] Peng, B. et al. (2023). YaRN. https://arxiv.org/abs/2309.00071

[112] Kwon, W. et al. (2023). vLLM/PagedAttention. https://arxiv.org/abs/2309.06180

[113] OpenAI (2023). GPT‑4 Technical Report. https://arxiv.org/abs/2303.08774

[114] Gemini Team (2023). Gemini. https://arxiv.org/abs/2312.11805

[115] Gu, A.; Dao, T. (2023). Mamba. https://arxiv.org/abs/2312.00752

[116] Sun, Y. et al. (2023). RetNet. https://arxiv.org/abs/2307.08621

[117] Jiang, A.Q. et al. (2024). Mixtral of Experts. https://arxiv.org/abs/2401.04088

[118] Mistral AI (2024). Mixtral 8x22B. https://mistral.ai/news/mixtral-8x22b

[119] Databricks (2024). Introducing DBRX. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

[120] Lieber, O. et al. (2024). Jamba. https://arxiv.org/abs/2403.19887

[121] Shah, M. et al. (2024). FlashAttention‑3. https://arxiv.org/abs/2407.08608

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

[86]

[87]

[88]

[89]

[90]

[91]

[92]

[93]

[94]

[95]

[96]

[97]

[98]

[99]

[100]

Large language model architectures

Contents

Families of LLM Architectures (Transformers)

1. Encoder-only

2. Decoder-only

3. Encoder-decoder

Dense Transformers

Principle of Operation and Architecture

Examples of Models and Scale

Advantages

Disadvantages

Scaling and Context Limitations

Modern Optimizations

Sparse Models and Mixture-of-Experts (MoE)

Examples of Models and Scale

Advantages

Limitations

Modern Improvements

When to Use MoE

Retrieval-Augmented Generation (RAG)

Attention Mechanisms and Context Handling

Model Optimization and Training Infrastructure

Scaling Laws and Compute-Optimal Training

State Space Models (SSM)

The Core Problem and Solution

Principle of Operation

Main Architectures and Hybrids

Advantages

Limitations and Current Practice

Comparison of Approaches (Generalized)

Practical Applications

Evolution of Architectures

Links

Literature

Notes

Navigation menu

Large language model architectures

Families of LLM Architectures (Transformers)

1. Encoder-only

2. Decoder-only

3. Encoder-decoder

Dense Transformers

Principle of Operation and Architecture

Examples of Models and Scale

Advantages

Disadvantages

Scaling and Context Limitations

Modern Optimizations

Sparse Models and Mixture-of-Experts (MoE)

Examples of Models and Scale

Advantages

Limitations

Modern Improvements

When to Use MoE

Retrieval-Augmented Generation (RAG)

Attention Mechanisms and Context Handling

Model Optimization and Training Infrastructure

Scaling Laws and Compute-Optimal Training

State Space Models (SSM)

The Core Problem and Solution

Principle of Operation

Main Architectures and Hybrids

Advantages

Limitations and Current Practice

Comparison of Approaches (Generalized)

Practical Applications

Evolution of Architectures

Links

Literature

Notes

Navigation menu

Search