Large language model architectures

From Systems Analysis Wiki
Jump to navigation Jump to search

Large Language Model (LLM) architectures are the fundamental principles and structures that define how large language models are built, trained, and operated. Modern LLMs, capable of understanding and generating human language, are almost entirely based on the Transformer architecture[1], but incorporate numerous enhancements and different approaches aimed at improving efficiency, scalability, and capabilities.

Families of LLM Architectures (Transformers)

Modern large language models are based on the Transformer architecture[1], but they utilize it differently depending on the objective: to understand text, generate a continuation, or transform one text into another. In practice, three main families are distinguished, all while preserving the core principles of the Transformer[2][3][4].

1. Encoder-only

The model uses only a stack of encoders and processes the entire input text bidirectionally. Pre-training is typically based on masked-language modeling (MLM): some tokens are masked, and the model learns to reconstruct them from their context. Thanks to this bidirectional context, such models excel at understanding and scoring tasks, such as classification, named entity recognition, document re-ranking, and extractive QA. They are not designed for autoregressive generation from scratch.

Additionally, alternative pre-training objectives are used for the encoder-only family in practice: replaced token detection (RTD) in ELECTRA (where a discriminator model identifies replaced tokens) and contrastive learning for bi-encoders in semantic search/retrieval (InfoNCE/softmax loss on "query-document" pairs, as in Dense Passage Retrieval). When used in RAG, encoder-only models act either as a bi-encoder (separate encoding of query and document for fast ANN search) or as a cross-encoder (joint encoding of the pair for precise re-ranking).

Advantages:

  • High-quality text understanding due to bidirectional context: classification, NER, fact extraction, re-ranking, extractive QA.
  • Parallel processing and high throughput: a single forward pass without auto-regression; convenient for batching mass scoring tasks.
  • Natural integration with search and RAG: as a bi-encoder for fast semantic search; as a cross-encoder for precise re-ranking.
  • Effective adaptation: relatively compact variants (≈100–300M parameters; BERT-base ≈110M) achieve high quality after task-specific fine-tuning.
  • Stable latency, independent of the generated response length (no step-by-step decoding); well-suited for offline scoring of large collections.
  • Ability to extend the context window using relative/rotary position embeddings and/or local-sparse attention (e.g., Longformer/BigBird), which is useful for long documents.

Disadvantages:

  • No native generative capabilities: requires a decoder or an external generative module for dialogues and detailed answers.
  • Limited in interactive scenarios: no step-by-step generation with state preservation.
  • Mismatch between pre-training objective and free-form generation tasks: MLM is less aligned with generation compared to causal language modeling.
  • Historically limited context window (often 512 tokens in base configurations with absolute positions); extension requires special position/attention schemes and/or fine-tuning.
  • For retrieval tasks, separate contrastive fine-tuning of a bi-encoder and/or cross-encoder is required; without it, search/re-ranking quality is typically lower than that of specially trained models.

Representative models: BERT and its derivatives, as well as RoBERTa and DeBERTa (advanced encoder-only variants); from alternative pre-training objectives, ELECTRA (RTD). [5][6][7][8][9][10]

2. Decoder-only

This architecture uses only a stack of decoders with causal (left-to-right) attention: the model predicts the next token based on a given prefix. This training mode—causal language modeling (CLM)—makes these models a natural choice for generation tasks: dialogues, detailed answers, creative text, and code. The trade-off is an increase in latency and KV cache size with long prompts. In practice, various engineering techniques are widely used for decoder-only models: reducing the KV cache with MQA and GQA, accelerating inference with speculative decoding, and server-side optimizations (PagedAttention/vLLM, continuous batching, chunked prefill).[11][12][13][14][15]

Advantages:

  • Natural text generation (CLM): strong zero-shot and few-shot capabilities; scales well.[16]
  • Versatility: a single model can solve many tasks through instructions and examples in the prompt; naturally combines with RAG and tool use.
  • Mature ecosystem: established practices for instruction fine-tuning and alignment (RLHF, DPO); open-source and commercial implementations are available.[17][18]
  • Rich stack of inference optimizations: MQA/GQA reduce KV cache size and increase throughput; speculative decoding accelerates inference without changing the output distribution; PagedAttention/vLLM with continuous batching and chunked prefill improve end-to-end GPU utilization.[19][20][21][22][23]
  • Support for structured generation for strict output formats (JSON/SQL/DSL), which simplifies integration with information systems and APIs.[24][25]

Disadvantages:

  • Increased generation latency: sequential output; the cost of a new token grows with the length of the already processed context (KV cache).
  • Less efficient for "long input–short output" profiles (summarization, translation) compared to encoder-decoder models, where the input is encoded only once.
  • Limited by unidirectional context: in understanding tasks, it can sometimes be inferior to models with bidirectional representations (encoder-only / encoder-decoder).
  • Memory for the KV cache can be a bottleneck with long prompts and large batches. [26]
  • Quantization of activations/KV (INT8/FP8) speeds up inference but can degrade quality on long contexts/code; requires careful validation (especially with strict SLAs).

Representative models: GPT-3, GPT-4 (architecture and dataset details are not publicly disclosed), LLaMA and Llama 3 (8B/70B, 2024).[27][28][29][30]

3. Encoder-decoder

This architecture combines both components. The encoder operates bidirectionally, while the decoder operates causally. The encoder analyzes the input once to form its representation; the decoder then generates the output by attending to this representation via cross-attention. This separate approach is particularly useful for tasks that require transforming a long input text into a short output, such as machine translation, summarization, and document-based question answering. Although this method requires greater overall computational resources (due to two stacks and cross-attention), its advantage is controlled generation based on a complete analysis of the source text. The encoding is performed only once and is reused throughout the entire inference process.

Advantages:

  • Conditional generation: the decoder uses cross-attention to the input representation. [31]
  • Efficient in "long input → short output" scenarios: the input is encoded only once.
  • Convenient for the "text-to-text" format and controlled output (task-specific prefixes, special instructions). [32]
  • Stability and efficiency with long source texts: during the decoding phase, only the self-attention over the output grows, while cross-attention reuses fixed keys/values from the encoder (the input is not "re-read" at each step).

Disadvantages:

  • Two stacks increase memory and computational requirements for training and deployment.
  • On extremely long sequences, the total latency is comparable to decoder-only models; auto-regression remains the bottleneck.
  • Fewer universal chat models compared to the decoder-only family; more often used as a high-quality seq2seq engine for specific tasks.
  • With very long inputs, the memory for cross-attention keys/values in each decoder layer increases (across the entire source), requiring careful serving planning.

Representative models: T5 (including T5 v1.1 and the instruction fine-tuning practice in FLAN-T5) and BART. [33][34][35]

Dense Transformers

The classic and most common LLM architecture, where nearly the entire set of model parameters is engaged in processing each token. Unlike sparse approaches (e.g., Mixture-of-Experts), there is no selective activation of subnetworks—every block operates on every token. [1]

Principle of Operation and Architecture

Basic Structure. The model is a stack of N identical Transformer blocks. Each block includes:

  1. Multi-Head Self-Attention. For each token, three vectors are computed: Q (query), K (key), and V (value); attention is defined as softmax(QK+Mdk)V, where M is a mask (causal and/or padding mask) that excludes invalid positions. Multiple attention "heads" run in parallel to consider different aspects of the context (H heads, usually dhead=dmodelH); their number grows with the model's scale. [1]
  2. Feed-Forward Network (FFN). Two linear layers with a non-linearity between them (typically GELU/SiLU; in several modern models, SwiGLU). The intermediate dimensionality is typically 4dmodel; when using SwiGLU, it is often set to 83dmodel to maintain a comparable number of parameters. The FFN contains a significant portion of the parameters. [1][36]

Additional Components. Residual connections and layer normalization are used; modern LLMs more often apply Pre-LN (normalization before sub-blocks), which improves training stability at greater depths. In addition to classic LayerNorm, RMSNorm is increasingly used (it reduces computational overhead and performs well in large models); some families also apply normalization in the attention space (e.g., normalizing Q/K before softmax). Positional representations can be absolute or relative; for long contexts, RoPE has become the de facto standard.

Examples of Models and Scale

  • BERT-Large: 24 layers, 1024 hidden size, 16 attention heads, ≈340M parameters. [37]
  • GPT-3 (175B): 96 layers, 12288 hidden size, 96 attention heads, ≈175B parameters. [38]
  • LLaMA-65B: 80 layers, 8192 hidden size, 64 attention heads, ≈65B parameters. [39]
  • PaLM-540B: 118 layers, hidden size of around 18432, ≈540B parameters. [40]

Advantages

  • Uniform blocks, well-studied training regimes, and predictable scaling behavior.
  • Quality improves as a power law with the growth of parameters and data; the compute-optimal regime involves increasing both model size and the volume of training tokens. [41][42]
  • The same architecture can cover a wide range of tasks after fine-tuning, without changes at the layer level.

Disadvantages

  • Full self-attention has quadratic complexity with respect to sequence length (O(n2)), which limits the context window. [1]
  • Full parameter activation at each generation step: in a non-MoE decoder, the inference cost per token grows approximately proportionally to the number of parameters.
  • Bottlenecked by memory bandwidth (memory-bound): loading weights from HBM often limits inference speed.

Scaling and Context Limitations

  • Memory for parameters grows linearly with model size; training memory increases due to gradients and optimizer states.
  • Base configurations were historically limited to 2,000–4,000 tokens. Modern positional schemes (RoPE) and extension techniques (Position Interpolation, YaRN, etc.) can increase the window by an order of magnitude or more, but at the cost of additional computational/memory load. [43][44]

Modern Optimizations

  • FlashAttention. An exact attention implementation that is aware of the GPU memory hierarchy; it reduces memory costs and accelerates training/inference for long sequences. [45]
  • KV Cache Reduction and Management. Multi-Query Attention and Grouped-Query Attention reduce cache size and memory traffic; at the server level, PagedAttention (vLLM) increases throughput via page-based cache management. [46][47][48]
  • Speculative Decoding. A draft model proposes a continuation, which the main model quickly verifies; this achieves acceleration without changing the output distribution. [49]

Sparse Models and Mixture-of-Experts (MoE)

MoE is a method for increasing a model's capacity without a proportional increase in computation per token. Instead of a single large FFN block, each layer uses a set of parallel "experts" (multiple independent FFNs), and a trainable router (gating network) selects the top-k most relevant experts for each token (typically k=1–2; in some models, k=4). Only the selected experts are activated; their outputs are weighted and summed. This allows the total number of parameters to reach hundreds of billions or even trillions, while only a small fraction is engaged at each step. [50][51]

Examples of Models and Scale

  • Switch Transformer (Google): up to ~1.6T parameters; top-1 routing (one expert per token). Showed that MoE allows for a dramatic increase in capacity at comparable per-token costs. [50]
  • GLaM (Google): 1.2T parameters, 64 experts per layer, top-2; ≈96.6B parameters (≈8%) are activated for each token. [51]
  • Mixtral 8×7B (Mistral AI): ~46.7B total parameters, ≈12.9B active per token, top-2. [52][53]
  • Mixtral 8×22B: ~141B total parameters, ≈39B active per token, top-2. [54]
  • DBRX (Databricks): 132B total parameters, ≈36B active per token; 16 experts and top-4 routing (fine-grained MoE). [55]
Advantages
  • Computational cost is determined by the number of active experts (k), not the total number of parameters: trillion-scale models can be trained and used with costs comparable to much smaller dense models. [51]
  • Specialization: experts automatically "adapt" to languages/domains/patterns, improving quality in multi-domain tasks.
  • Flexible deployment: frequently used experts can be kept in memory, while rare ones are loaded on demand (with appropriate infrastructure).
Limitations
  • Load balancing: without regularization, the router can "stick" to a subset of experts (router collapse). Auxiliary losses (load-balancing) and improved routing schemes are needed. [50]
  • Complexity of distributed computing: requires expert parallelism and all-to-all communication; communication overhead and memory management become bottlenecks. [56]
  • Training stability: router settings and capacity limits are crucial; otherwise, quality/convergence degradation is possible.

Modern Improvements

  • Expert-Choice routing: experts "choose" tokens, which improves load balancing and convergence at comparable costs. [57]
  • Fine-grained MoE: a larger number of smaller experts (as in DBRX) provides finer granularity of specialization. [55]
  • Sparse Upcycling: converting a dense model into an MoE from its checkpoint can significantly improve quality at a moderate cost. [58]

When to Use MoE

  • Large multi-domain assistants with a limited compute budget.
  • Training on vast corpora where specialization offers an advantage.
  • Scenarios with advanced distributed infrastructure (many GPUs/TPUs and fast networks).

When dense models are better: limited infrastructure (1–2 GPUs), strict requirements for predictable latency, and simplicity of deployment.

Retrieval-Augmented Generation (RAG)

RAG is an architectural system pattern built around an LLM, rather than an internal architecture of the model itself. It combines an LLM (the generative component) with an external knowledge base (the retrieval component), which helps compensate for the limitations of the model's "parametric memory".

  • Principle of Operation: Before generating a response, the LLM retrieves relevant documents from an external source (a wiki, a corporate knowledge base, the web) and relies on them to formulate the answer. [59]
  • Advantages:
    • Reduced hallucinations and improved factual accuracy. [59][60]
    • Up-to-date information without fully retraining the model. [59]
    • Citable and traceable responses.
  • Application: The de facto standard for enterprise assistants and systems requiring verifiable facts and operation on private/specialized data. [59]

Attention Mechanisms and Context Handling

Basic self-attention has quadratic complexity with respect to sequence length (O(n2)), leading to the development of optimizations.

  • Sparse Attention: Restricts attention to local windows/patterns. Examples: Longformer[61], BigBird[62].
  • FlashAttention: Reorders computations to account for the GPU memory hierarchy; it provides significant gains in speed and memory usage and has become the de facto standard for training LLMs with long contexts[63][64][65].
  • MQA/GQA (decoding acceleration): Multi-Query Attention (shared keys/values for all heads) reduces KV cache traffic[66]. Grouped-Query Attention balances quality and speed[67].
  • Improved positional representations:
    • ALiBi (Attention with Linear Biases): Linear biases added to attention scores improve generalization to longer lengths. [68]
    • RoPE (Rotary Position Embeddings): Relative positional information via rotation of Q/K; widely used in modern models (e.g., LLaMA). [69][70]
    • Context extension for RoPE models: Position Interpolation [71], YaRN [72], and NTK-aware modifications allow for efficient context window extension without architectural changes.
  • Other approaches for long sequences:
    • Transformer-XL: Recurrent memory between segments to model long-range dependencies. [73]
    • Reformer: LSH-attention and reversible residual blocks to save memory. [74]
    • Performer: Linear approximation of softmax-attention (FAVOR+). [75]
    • Linformer: Low-rank approximation of the attention matrix. [76]

Model Optimization and Training Infrastructure

Specialized techniques and frameworks are used for training and deploying LLMs.

  • Quantization: Reducing the bit precision of weights decreases memory usage and accelerates inference. QLoRA enables efficient fine-tuning of 4-bit models (including 65B models) with quality close to full precision[77].
  • Knowledge Distillation: Teacher→Student training for compact models[78]; an example is DistilBERT[79].
  • Distributed Training:
    • DeepSpeed and ZeRO distribute parameters/gradients/optimizer states to train trillion-parameter models[80].
    • Megatron-LM uses tensor and pipeline parallelism for very large transformers[81].
  • Ecosystem and Tools: Hugging Face Transformers and Accelerate provide standard model implementations and integration with DeepSpeed/FSDP for training and inference[82][83].

Scaling Laws and Compute-Optimal Training

Empirical scaling laws demonstrate that cross-entropy loss decreases as a power law with increases in parameters, data, and computation. [84] The Chinchilla paper refined the compute-optimal regimes: for optimal efficiency, the model size and the number of training tokens should be scaled together (e.g., a 70B model trained on ~1.4T tokens outperforms larger, under-trained models). [85]

State Space Models (SSM)

State Space Models (SSM) are an alternative architecture to Transformers for processing long sequences. They borrow ideas from control theory and digital signal processing and address the main problem of self-attention: the quadratic growth of computation with increasing text length.

The Core Problem and Solution

The Transformer Problem. The main issue with traditional transformers is the quadratic complexity of attention: text that is 10 times longer requires about 100 times more computation.

The SSM Approach. Instead of "simultaneous attention to all words," the model processes the text sequentially and maintains a compact internal memory state that is updated at each step. As a result, time and memory consumption grow approximately linearly with text length. At the same time, training can be performed in parallel through a convolutional representation of the kernel (high throughput on long sequences). [86]

Principle of Operation

A discrete SSM is described by the state and output equations:

xt=Axt1+But,yt=Cxt+Dut

where xt is the memory state, ut is the input (token), and yt is the output. In deep SSMs, the matrices A,B,C,D are parameterized to ensure stability and efficient computation on long sequences. The same layer can be viewed as:

  • recurrent (scanning step-by-step) — for memory-efficient inference without a KV cache;
  • convolutional — for parallel training with a pre-computed kernel. [86]

Main Architectures and Hybrids

  • S4 (Structured State Spaces). The baseline SSM with a stable parameterization of the state matrix; demonstrates efficiency on very long sequences. [86]
  • Mamba. Selective SSMs: the memory update rules depend on the current input (the model decides what to "keep in memory" and what to "forget"). The implementation is optimized for the GPU memory hierarchy; according to the authors, it achieves a multiple-fold increase in inference throughput with linear complexity in sequence length. [87]
  • RetNet. A retention mechanism with three modes: parallel training, recurrent inference, and chunk-wise recurrent inference. The goal is to combine fast training (like Transformers) with efficient streaming inference (O(1) memory per token). [88]
  • Attention+SSM Hybrids. An example is Jamba (alternating Transformer and Mamba layers, plus MoE): it reports support for contexts of around ~256K tokens with significantly lower memory requirements compared to pure transformer models of a similar class. [89]

Advantages

  • Linear complexity and memory efficiency during inference. No global self-attention or KV cache; only a compact state is stored. [87][88]
  • Parallel training on long sequences. The convolutional mode increases training throughput. [86]
  • Hardware efficiency. Implementations are optimized for the modern memory hierarchy (HBM/SRAM). [87]
  • Long contexts and streaming. SSM+Attention hybrids are practical for hundreds of thousands of tokens with moderate resources. [89]

Limitations and Current Practice

  • Ecosystem maturity. Tools and "recipes" for scaling (instruction tuning, RLHF/DPO) are not yet as developed as the Transformer stack. [87]
  • Quality and stability. On some tasks, hybrids (Attention+SSM) show a more stable trade-off between quality, speed, and memory than "pure" SSMs. [89]

Comparison of Approaches (Generalized)

Characteristic Transformers SSM Hybrids (Attention+SSM)
Complexity by length Quadratic (self-attention) Linear (scan/convolution) Close to linear
Memory per token (inference) KV cache grows with context O(1) state Moderate growth
Long contexts Requires special optimizations Natively supported Practical up to ~256K
Ecosystem maturity High Developing Developing

Practical Applications

  • Analysis of very long documents (books, reports, scientific reviews).
  • Stream processing and chat scenarios with long histories without increased memory costs.
  • Environments with limited resources (mobile/edge devices).
  • Time series and other sequential data.

Representative models: S4, Mamba, RetNet; Attention+SSM hybrids (Jamba). [86][87][88][89]

Evolution of Architectures

  • 2017 — The paper "Attention Is All You Need" is published. It introduces the Transformer architecture: multi-head self-attention and positional encodings allow models to be trained without recurrence or convolutions; however, attention has quadratic complexity with respect to context length.[1]
  • 2018 — GPT-1 and BERT are introduced. GPT-1 uses a decoder-only stack with causal attention for generation and subsequent fine-tuning; BERT introduces a bidirectional encoder and MLM pre-training for text understanding tasks. [90][91]
  • 2019 — Methods for handling long sequences are proposed, and the decoder-only approach is scaled up. Transformer-XL adds "memory" and relative positions to extend beyond a fixed window; GPT-2 demonstrates the growth of zero-shot capabilities with increased scale; BART shows the effectiveness of denoising pre-training for seq2seq. [92][93][94]
  • 2020 — The "text-to-text" format is unified, and methods for long documents are shown. T5 formulates a unified encoder-decoder approach for various tasks; Longformer and BigBird use sparse/structured attention for long texts; GPT-3 confirms the effectiveness of scaling dense decoder-only models. [95][96][97][98]
  • 2021 — Positional representations are improved, and parameter sparsity (MoE) is demonstrated. RoPE and ALiBi improve generalization to longer lengths; Switch Transformer and GLaM activate only a subset of experts per token, increasing capacity without a proportional increase in inference cost. [99][100][101][102]
  • 2022 — The compute-optimal regime is refined, and inference on long prompts is accelerated. Chinchilla shows the benefit of more training tokens with a moderate model size; PaLM with Multi-Query Attention reduces KV cache size; FlashAttention speeds up attention on GPUs. [103][104][105][106]
  • 2023 — Context windows are extended without layer modifications, and server-side delivery is improved. The LLaMA series solidifies best practices (RMSNorm, SwiGLU, RoPE); Position Interpolation and YaRN extend context; vLLM/PagedAttention more efficiently manages the KV cache. [107][108][109][110][111][112]
  • 2023 — GPT-4 and Gemini demonstrate multi-modal processing and generation within a single family of models. [113][114]
  • 2023 — State Space Models (SSM) are proposed. Mamba and RetNet bring back sequential processing with a compact state instead of a KV cache, laying the groundwork for hybrid architectures. [115][116]
  • 2024 — Open-source MoE models and Attention+SSM hybrids are published; attention is accelerated on new GPUs. Mixtral 8×7B/8×22B and DBRX confirm the practicality of MoE; Jamba combines Transformer and Mamba for very long contexts; FlashAttention-3 increases throughput. [117][118][119][120][121]

Literature


Notes

  1. 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Vaswani, A. et al. (2017). Attention Is All You Need. https://arxiv.org/abs/1706.03762
  2. Devlin, J. et al. (2019). BERT. https://arxiv.org/abs/1810.04805
  3. Brown, T. et al. (2020). Language Models are Few‑Shot Learners. https://arxiv.org/abs/2005.14165
  4. Raffel, C. et al. (2020). T5. https://jmlr.org/papers/volume21/20-074/20-074.pdf
  5. Devlin, J. et al. (2019). BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805
  6. Liu, Y. et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://arxiv.org/abs/1907.11692
  7. He, P. et al. (2021). DeBERTa: Decoding‑enhanced BERT with Disentangled Attention. https://arxiv.org/abs/2006.03654
  8. Clark, K. et al. (2020). ELECTRA: Pre‑training Text Encoders as Discriminators Rather Than Generators. https://arxiv.org/abs/2003.10555
  9. Zaheer, M. et al. (2020). Big Bird: Transformers for Longer Sequences. https://arxiv.org/abs/2007.14062
  10. Beltagy, I. et al. (2020). Longformer: The Long‑Document Transformer. https://arxiv.org/abs/2004.05150
  11. Shazeer, N. (2019). Fast Transformer Decoding: One Write‑Head is All You Need (Multi‑Query Attention). https://arxiv.org/abs/1911.02150
  12. Ainslie, J. et al. (2023). GQA: Training Generalized Multi‑Query Transformer Models from Multi‑Head Checkpoints. https://arxiv.org/abs/2305.13245
  13. Leviathan, Y. et al. (2023). Fast Inference from Transformers via Speculative Decoding. https://arxiv.org/abs/2211.17192
  14. Kwon, W. et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention (vLLM). https://arxiv.org/abs/2309.06180
  15. vLLM Docs (2024–2025). Continuous batching, Chunked prefill, Structured outputs. https://docs.vllm.ai/
  16. Brown, T. et al. (2020). Language Models are Few‑Shot Learners. https://arxiv.org/abs/2005.14165
  17. Ouyang, L. et al. (2022). InstructGPT (RLHF). https://arxiv.org/abs/2203.02155
  18. Rafailov, R. et al. (2023). Direct Preference Optimization. https://arxiv.org/abs/2305.18290
  19. Shazeer, 2019. https://arxiv.org/abs/1911.02150
  20. Ainslie, 2023. https://arxiv.org/abs/2305.13245
  21. Leviathan, 2023. https://arxiv.org/abs/2211.17192
  22. Kwon, 2023. https://arxiv.org/abs/2309.06180
  23. vLLM Docs. https://docs.vllm.ai/
  24. OpenAI (2024). Structured Outputs. https://openai.com/index/introducing-structured-outputs-in-the-api/
  25. vLLM Docs — Structured outputs. https://docs.vllm.ai/en/v0.9.2/features/structured_outputs.html
  26. Kwon, 2023. https://arxiv.org/abs/2309.06180
  27. Brown, T. et al. (2020). Language Models are Few‑Shot Learners. https://arxiv.org/abs/2005.14165
  28. Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971
  29. Achiam, J. et al. (2023). GPT‑4 Technical Report. https://arxiv.org/abs/2303.08774
  30. Meta AI (2024). Introducing Meta Llama 3. https://ai.meta.com/blog/meta-llama-3/
  31. Raffel, C. et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text‑to‑Text Transformer (T5). JMLR. https://jmlr.org/papers/volume21/20-074/20-074.pdf
  32. Lewis, M. et al. (2019). BART: Denoising Sequence‑to‑Sequence Pre‑training. https://arxiv.org/abs/1910.13461
  33. Raffel, C. et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text‑to‑Text Transformer. JMLR. https://jmlr.org/papers/volume21/20-074/20-074.pdf
  34. Lewis, M. et al. (2019). BART: Denoising Sequence‑to‑Sequence Pre‑training for NLG, Translation, and Comprehension. https://arxiv.org/abs/1910.13461
  35. Chung, H. W. et al. (2022). Scaling Instruction‑Finetuned Language Models (FLAN‑T5). https://arxiv.org/abs/2210.11416
  36. Shazeer, N. (2020). GLU Variants Improve Transformer. https://arxiv.org/abs/2002.05202
  37. Devlin, J. et al. (2019). BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805
  38. Brown, T. et al. (2020). Language Models are Few‑Shot Learners. https://arxiv.org/abs/2005.14165
  39. Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971
  40. Chowdhery, A. et al. (2022). PaLM: Scaling Language Modeling with Pathways. https://arxiv.org/abs/2204.02311
  41. Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. https://arxiv.org/abs/2001.08361
  42. Hoffmann, J. et al. (2022). Training Compute‑Optimal Large Language Models. https://arxiv.org/abs/2203.15556
  43. Chen, S. et al. (2023). Extending Context Window via Positional Interpolation. https://arxiv.org/abs/2306.15595
  44. Peng, B. et al. (2023). YaRN: Efficient Context Window Extension of LLMs. https://arxiv.org/abs/2309.00071
  45. Dao, T. et al. (2022–2024). FlashAttention (1/2/3). https://arxiv.org/abs/2205.14135 ; https://arxiv.org/abs/2307.08691 ; https://arxiv.org/abs/2407.08608
  46. Shazeer, N. (2019). Fast Transformer Decoding: One Write‑Head is All You Need. https://arxiv.org/abs/1911.02150
  47. Ainslie, J. et al. (2023). GQA. https://arxiv.org/abs/2305.13245
  48. Kwon, W. et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention. https://arxiv.org/abs/2309.06180
  49. Leviathan, Y. et al. (2023). Fast Inference from Transformers via Speculative Decoding. https://arxiv.org/abs/2211.17192
  50. 50.0 50.1 50.2 Fedus, W.; Zoph, B.; Shazeer, N. (2021/2022). Switch Transformers. https://arxiv.org/abs/2101.03961
  51. 51.0 51.1 51.2 Du, N. et al. (2021). GLaM: Efficient Scaling of Language Models with Mixture‑of‑Experts. https://arxiv.org/pdf/2112.06905.pdf
  52. Mistral AI (2023). Mixtral of Experts. https://mistral.ai/news/mixtral-of-experts/
  53. Jiang, A.Q. et al. (2024). Mixtral of Experts. https://arxiv.org/abs/2401.04088
  54. Mistral AI (2024). Mixtral 8x22B. https://mistral.ai/news/mixtral-8x22b
  55. 55.0 55.1 Databricks (2024). Introducing DBRX. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
  56. NVIDIA (2024). Applying Mixture of Experts in LLM Architectures. https://developer.nvidia.com/blog/applying-mixture-of-experts-in-llm-architectures/
  57. Zhou, Y. et al. (2022). Mixture‑of‑Experts with Expert Choice Routing. https://arxiv.org/abs/2202.09368
  58. Komatsuzaki, A. et al. (2022). Sparse Upcycling: Training Mixture‑of‑Experts from Dense Checkpoints. https://arxiv.org/abs/2212.05055
  59. 59.0 59.1 59.2 59.3 Lewis, P. et al. (2020). Retrieval‑Augmented Generation for Knowledge‑Intensive NLP Tasks. https://arxiv.org/abs/2005.11401
  60. NVIDIA Blog (2025). What is Retrieval‑Augmented Generation (RAG). https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/
  61. Beltagy, I. et al. (2020). Longformer. https://arxiv.org/abs/2004.05150
  62. Zaheer, M. et al. (2020). Big Bird. https://arxiv.org/abs/2007.14062
  63. Dao, T. et al. (2022). FlashAttention. https://arxiv.org/abs/2205.14135
  64. Dao, T. et al. (2023). FlashAttention‑2. https://arxiv.org/abs/2307.08691
  65. Shah, M. et al. (2024). FlashAttention‑3. https://arxiv.org/abs/2407.08608
  66. Shazeer, N. (2019). Fast Transformer Decoding: One Write‑Head is All You Need. https://arxiv.org/abs/1911.02150
  67. Ainslie, J. et al. (2023). GQA. https://arxiv.org/abs/2305.13245
  68. Press, O. et al. (2022). ALiBi. https://arxiv.org/abs/2108.12409
  69. Su, J. et al. (2021). RoFormer: Rotary Position Embedding. https://arxiv.org/abs/2104.09864
  70. Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971
  71. Chen, S. et al. (2023). Extending Context Window via Positional Interpolation. https://arxiv.org/abs/2306.15595
  72. Peng, B. et al. (2023). YaRN: Efficient Context Window Extension of LLMs. https://arxiv.org/abs/2309.00071
  73. Dai, Z. et al. (2019). Transformer‑XL: Attentive Language Models Beyond a Fixed‑Length Context. https://arxiv.org/abs/1901.02860
  74. Kitaev, N.; Kaiser, L.; Levskaya, A. (2020). Reformer: The Efficient Transformer. https://arxiv.org/abs/2001.04451
  75. Choromanski, K. et al. (2021). Rethinking Attention with Performers. https://arxiv.org/abs/2009.14794
  76. Wang, S. et al. (2020). Linformer: Self‑Attention with Linear Complexity. https://arxiv.org/abs/2006.04768
  77. Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. https://arxiv.org/abs/2305.14314
  78. Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. https://arxiv.org/abs/1503.02531
  79. Sanh, V. et al. (2019). DistilBERT. https://arxiv.org/abs/1910.01108
  80. Rajbhandari, S. et al. (2020). ZeRO: Memory Optimizations Toward Training Trillion‑Parameter Models. https://www.microsoft.com/en-us/research/publication/zero-memory-optimizations-toward-training-trillion-parameter-models/
  81. Shoeybi, M. et al. (2019). Megatron‑LM: Training Multi‑Billion Parameter Language Models Using Model Parallelism. https://arxiv.org/abs/1909.08053
  82. Hugging Face. Transformers Documentation. https://huggingface.co/docs/transformers
  83. Hugging Face. Accelerate Documentation. https://huggingface.co/docs/accelerate
  84. Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. https://arxiv.org/abs/2001.08361
  85. Hoffmann, J. et al. (2022). Training Compute‑Optimal Large Language Models. https://arxiv.org/abs/2203.15556
  86. 86.0 86.1 86.2 86.3 86.4 Gu, A.; Goel, K.; Ré, C. (2021). Efficiently Modeling Long Sequences with Structured State Spaces (S4). https://arxiv.org/abs/2111.00396
  87. 87.0 87.1 87.2 87.3 87.4 Gu, A.; Dao, T. (2023/2024). Mamba: Linear‑Time Sequence Modeling with Selective State Spaces. https://arxiv.org/abs/2312.00752
  88. 88.0 88.1 88.2 Sun, Y. et al. (2023). Retentive Network: A Successor to Transformer for Large Language Models. https://arxiv.org/abs/2307.08621
  89. 89.0 89.1 89.2 89.3 Lieber, O. et al. (2024). Jamba: A Hybrid Transformer‑Mamba Language Model. https://arxiv.org/abs/2403.19887
  90. Radford, A. et al. (2018). Improving Language Understanding by Generative Pre‑Training. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
  91. Devlin, J. et al. (2019). BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805
  92. Dai, Z. et al. (2019). Transformer‑XL. https://arxiv.org/abs/1901.02860
  93. Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
  94. Lewis, M. et al. (2019). BART. https://arxiv.org/abs/1910.13461
  95. Raffel, C. et al. (2020). T5. https://jmlr.org/papers/volume21/20-074/20-074.pdf
  96. Beltagy, I. et al. (2020). Longformer. https://arxiv.org/abs/2004.05150
  97. Zaheer, M. et al. (2020). BigBird. https://arxiv.org/abs/2007.14062
  98. Brown, T. et al. (2020). Language Models are Few‑Shot Learners. https://arxiv.org/abs/2005.14165
  99. Su, J. et al. (2021). RoPE. https://arxiv.org/abs/2104.09864
  100. Press, O. et al. (2021/2022). ALiBi. https://arxiv.org/abs/2108.12409
  101. Fedus, W.; Zoph, B.; Shazeer, N. (2021/2022). Switch Transformers. https://arxiv.org/abs/2101.03961
  102. Du, N. et al. (2021). GLaM. https://arxiv.org/pdf/2112.06905.pdf
  103. Hoffmann, J. et al. (2022). Chinchilla. https://arxiv.org/abs/2203.15556
  104. Chowdhery, A. et al. (2022). PaLM. https://arxiv.org/abs/2204.02311
  105. Shazeer, N. (2019). Fast Transformer Decoding. https://arxiv.org/abs/1911.02150
  106. Dao, T. et al. (2022). FlashAttention. https://arxiv.org/abs/2205.14135
  107. Touvron, H. et al. (2023). LLaMA. https://arxiv.org/abs/2302.13971
  108. Zhang, B.; Sennrich, R. (2019). RMSNorm. https://arxiv.org/abs/1910.07467
  109. Shazeer, N. (2020). GLU Variants. https://arxiv.org/abs/2002.05202
  110. Chen, S. et al. (2023). Position Interpolation. https://arxiv.org/abs/2306.15595
  111. Peng, B. et al. (2023). YaRN. https://arxiv.org/abs/2309.00071
  112. Kwon, W. et al. (2023). vLLM/PagedAttention. https://arxiv.org/abs/2309.06180
  113. OpenAI (2023). GPT‑4 Technical Report. https://arxiv.org/abs/2303.08774
  114. Gemini Team (2023). Gemini. https://arxiv.org/abs/2312.11805
  115. Gu, A.; Dao, T. (2023). Mamba. https://arxiv.org/abs/2312.00752
  116. Sun, Y. et al. (2023). RetNet. https://arxiv.org/abs/2307.08621
  117. Jiang, A.Q. et al. (2024). Mixtral of Experts. https://arxiv.org/abs/2401.04088
  118. Mistral AI (2024). Mixtral 8x22B. https://mistral.ai/news/mixtral-8x22b
  119. Databricks (2024). Introducing DBRX. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
  120. Lieber, O. et al. (2024). Jamba. https://arxiv.org/abs/2403.19887
  121. Shah, M. et al. (2024). FlashAttention‑3. https://arxiv.org/abs/2407.08608