Transformer architecture

From Systems Analysis Wiki
Jump to navigation Jump to search

Transformer architecture is a neural network architecture introduced in 2017 by Google researchers in the paper "Attention Is All You Need"[1]. It revolutionized the field of natural language processing (NLP) and has become the foundation for most modern large language models (LLMs), such as BERT, GPT, and Gemini. The key innovation of the Transformer is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input data and process sequences in parallel, abandoning the recurrence inherent in RNNs and LSTMs.

Historical Context and Prerequisites

Before 2017, the dominant architectures for processing sequential data, such as text, were recurrent neural networks (RNNs) and their advanced variant, long short-term memory (LSTM) networks.

Problems with RNN/LSTM Addressed by the Transformer

  • Sequential Processing Limitations: RNNs and LSTMs process data token by token, which prevents intra-sequence parallelism and slows down training on large datasets.
  • Vanishing and Exploding Gradients Problem: In long sequences, gradients propagated backward through many steps can either vanish or explode, making it difficult to train long-term dependencies.
  • Long-Term Dependencies: Information from the beginning of a sequence can be lost by the time the end is reached.

The Transformer's approach is a complete departure from recurrence in favor of an attention mechanism. It provides a constant dependency path length between any two positions (O(1)), which facilitates the modeling of long-range dependencies, although the basic implementation of self-attention has a quadratic computational complexity with respect to the sequence length (O(n2))[1][2]. The vanishing/exploding gradients problem doesn't "disappear" but is mitigated by residual connections, LayerNorm, and the training regimen; modern implementations often use the Pre-LayerNorm (Pre-LN) variant for greater training stability[3].

Architecture and Key Components

The original Transformer architecture consists of two main parts: an encoder and a decoder. Both components are stacks of identical layers (N=6 in the original paper)[1].

  • Encoder Layer Structure: (1) multi-head self-attention (MHA), (2) a position-wise feed-forward network (FFN); each sub-layer is surrounded by a residual connection and LayerNorm[1].
  • Decoder Layer Structure: (1) masked self-attention (a causal mask prevents access to future positions), (2) cross-attention to the encoder outputs, (3) an FFN—also with residuals and LayerNorm[1].

Attention Mechanism and Self-Attention

The attention mechanism computes a weighted sum of value vectors (Value), where the weights are determined by the compatibility of keys (Key) with queries (Query). The Transformer uses Scaled Dot-Product Attention:

Attention(Q,K,V)=softmax(QKdk)V

Where dk is the dimension of the keys/queries; dividing by dk prevents the softmax function from saturating[1]. When Q, K, and V are generated from the same sequence, the mechanism is called self-attention.

Multi-Head Attention

Instead of a single set of matrices (WQ,WK,WV), h parallel "heads" are used, each of which projects Q,K,V into lower-dimensional subspaces, computes attention independently, and then the results are concatenated and projected[1]:

MultiHead(Q,K,V)=Concat(head1,,headh)WO,where headi=Attention(QWiQ,KWiK,VWiV).

Variants for Inference Acceleration:

  • MQA (Multi-Query Attention): All heads share a single key/value pair, which significantly reduces the size and memory traffic of the KV cache during decoding[4].
  • GQA (Grouped-Query Attention): A compromise between MHA and MQA—several groups of heads share K/V; its quality is close to MHA with MQA-like speed[5].

Positional Encoding

Since self-attention is invariant to token order, positional encodings are added to the input embeddings.

  • Original sinusoidal encodings (PE) from[1]:
PE(pos,2i)=sin(pos/100002i/dmodel),PE(pos,2i+1)=cos(pos/100002i/dmodel).
  • Modern relative/rotary variants:
    • RoPE (Rotary Position Embeddings) encodes relative shifts by rotating the Q/K vectors; it is used in several modern LLMs[6].
    • ALiBi introduces a linear bias to attention scores, which improves extrapolation to lengths longer than those seen during training[7].

Position-wise FFN, Residuals, and Normalization

Each encoder and decoder layer, in addition to attention, contains a position-wise feed-forward network (FFN):

FFN(x)=max(0,xW1+b1)W2+b2.

Residual connections and Layer Normalization are used around each sub-layer: LayerNorm(x+Sublayer(x)). The original paper used the Post-LN variant[1]; modern LLMs often use Pre-LN for better training stability and less reliance on a long warm-up period[3].

Evolution and Modern Variants

Early methods based on recurrent neural networks (RNNs) and their advanced variants, such as LSTMs, processed text sequentially, one token at a time. Although this approach intuitively matched the structure of language, it created a significant limitation: it hindered parallel computation and made it difficult to capture dependencies between elements that were far apart in the text. In 2017, a group of researchers from Google introduced a paper titled "Attention Is All You Need." In it, they described a new architecture—the "Transformer." This model was the first to completely abandon the use of recurrent neural networks, replacing them with an "attention" mechanism. The main innovation was that the attention mechanism allowed the Transformer to weigh the importance of every word in the input sequence when generating the corresponding word in the output. Crucially, the model could process all words simultaneously. This capability for parallel processing made it possible to train much larger models on vast amounts of data. This resulted in the emergence of modern large language models (LLMs).

The Transformer architecture served as the basis for numerous models, which can be broadly divided into three classes.

1. Encoder-only Models

  • Example: BERT (and RoBERTa, ALBERT)[8].
  • Principle: Pre-training on a masked language modeling (MLM) task with a bidirectional context.
  • Application: Understanding tasks (classification, NER, etc.).

2. Decoder-only Models

  • Example: The GPT series (GPT-1/2/3)[9][10], LLaMA[11], Claude.
  • Principle: Causal language modeling (CLM)—predicting the next token; a causal mask is applied to the attention mechanism[1].
  • Application: Text generation, dialogue, and code.

3. Encoder-decoder Models

  • Example: The original Transformer, T5, BART[1][12].
  • Principle: The encoder builds a representation of the input, and the decoder generates the output using cross-attention to the encoder's features[1].
  • Application: Seq2seq tasks (translation, summarization, etc.).

4. Multimodal and Alternative Architectures

  • Vision Transformer (ViT)—an adaptation for images (by splitting them into patches)[13]; Swin Transformer—a hierarchical model using shifted windows[14].
  • Alternatives for long sequences:
    • Mamba—selective state space models (SSMs) with linear complexity[15].
    • RWKV—an RNN-like architecture with parallelizable training and linear inference complexity[16].
    • Hybrids (e.g., Jamba): Alternate between Transformer and Mamba blocks; sometimes supplemented with MoE[17].

Training and Optimization Techniques

The effectiveness of the Transformer is closely tied to training techniques and infrastructure.

  • Pre-training Strategies: CLM and MLM; also contrastive and denoising objectives (ELECTRA, T5)[12].
  • Fine-tuning Techniques:
    • Full fine-tuning of all parameters.
    • Parameter-Efficient Fine-Tuning (PEFT): LoRA introduces low-rank adapters while keeping the base weights frozen[18].
  • Behavioral Alignment: RLHF—Reinforcement Learning from Human Feedback[19].
  • System-level Inference Optimizations: PagedAttention/vLLM increases serving throughput via paged management of the KV cache; especially useful for long sequences and large batches[20].

Literature

  • Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. NeurIPS. arXiv:1706.03762.
  • Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.
  • Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI Technical Report.
  • Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language Models Are Few-Shot Learners. NeurIPS. arXiv:2005.14165.
  • Raffel, C., Shazeer, N., Roberts, A., et al. (2019). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683.
  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929.
  • Liu, Z., Lin, Y., Cao, Y., et al. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv:2103.14030.
  • Tay, Y., Dehghani, M., Bahri, D., Metzler, D. (2020). Efficient Transformers: A Survey. arXiv:2009.06732.
  • Xiong, R., Yang, Y., He, D., et al. (2020). On Layer Normalization in the Transformer Architecture. ICML. arXiv:2002.04745.
  • Su, J., Lu, Y., Pan, S., et al. (2021). RoFormer: Rotary Position Embedding. arXiv:2104.09864.
  • Press, O., Smith, N. A., Lewis, M. (2021). Train Short, Test Long: Attention with Linear Biases (ALiBi). arXiv:2108.12409.
  • Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need (MQA). arXiv:1911.02150.
  • Ainslie, J., Lee-Thorp, J., de Jong, M., et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. EMNLP. arXiv:2305.13245.
  • Kwon, W., Li, Z., Zhuang, S., et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention (vLLM). arXiv:2309.06180.
  • Touvron, H., Lavril, T., Izacard, G., et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
  • Gu, A., Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752.
  • Peng, B., et al. (2023). RWKV: Reinventing RNNs for the Transformer Era. arXiv:2305.13048.
  • Lieber, O., Lenz, B., Bata, H., et al. (2024). Jamba: A Hybrid Transformer-Mamba Language Model. arXiv:2403.19887.
  • Hu, E. J., Shen, Y., Wallis, P., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
  • Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. OpenReview.

References

  1. 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. NeurIPS. arXiv:1706.03762.
  2. Tay, Y., Dehghani, M., Bahri, D., Metzler, D. (2020). Efficient Transformers: A Survey. arXiv:2009.06732.
  3. 3.0 3.1 Xiong, R., Yang, Y., He, D., et al. (2020). On Layer Normalization in the Transformer Architecture. ICML. arXiv:2002.04745.
  4. Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150.
  5. Ainslie, J., Lee-Thorp, J., de Jong, M., et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. EMNLP. arXiv:2305.13245.
  6. Su, J., Lu, Y., Pan, S., et al. (2021). RoFormer: Rotary Position Embedding. arXiv:2104.09864.
  7. Press, O., Smith, N. A., Lewis, M. (2021). Train Short, Test Long: Attention with Linear Biases (ALiBi). arXiv:2108.12409.
  8. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.
  9. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI.
  10. Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language Models Are Few-Shot Learners. NeurIPS. arXiv:2005.14165.
  11. Touvron, H., Lavril, T., Izacard, G., et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
  12. 12.0 12.1 Raffel, C., Shazeer, N., Roberts, A., et al. (2019). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR. arXiv:1910.10683.
  13. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR. arXiv:2010.11929.
  14. Liu, Z., Lin, Y., Cao, Y., et al. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV. arXiv:2103.14030.
  15. Gu, A., Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752.
  16. Peng, B., et al. (2023). RWKV: Reinventing RNNs for the Transformer Era. arXiv:2305.13048.
  17. Lieber, O., Lenz, B., Bata, H., et al. (2024). Jamba: A Hybrid Transformer-Mamba Language Model. arXiv:2403.19887.
  18. Hu, E. J., Shen, Y., Wallis, P., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
  19. Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. OpenReview.
  20. Kwon, W., Li, Z., Zhuang, S., et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention. arXiv:2309.06180.