Decoder-only models (architecture)

From Systems Analysis Wiki
Jump to navigation Jump to search

Decoder-only models are a dominant class of large language model (LLM) architectures based exclusively on the decoder part of the Transformer architecture. These models specialize in text generation tasks and form the foundation for most modern chatbots and AI assistants.

The flagship series that popularized this approach is the GPT family of models from OpenAI.

Concept and Architecture

The core idea behind decoder-only models is the autoregressive generation of sequences. This means the model predicts the next token based on all the preceding tokens that have been generated. The input prompt (user query) and the already generated text are treated as a single sequence for the model to continue.

Architecturally, the model is a stack of N identical decoder layers. Each layer, unlike an encoder or a full decoder, contains only two main sub-layers:

  1. Masked Multi-Head Self-Attention: This is the key mechanism that enables the autoregressive property. During sequence processing, a special causal mask prevents each token from "looking at" subsequent tokens. Thus, the prediction for position i depends only on tokens at positions less than i.
  2. Feed-Forward Network: Applies a non-linear transformation to the representation of each token.

Decoder-only models lack a cross-attention mechanism, as there is no encoder to "attend to."

Pre-training Tasks

Decoder-only models are trained on a single but very powerful self-supervised task:

Causal Language Modeling (CLM)

  • How it works: The model is trained to predict the next token in a sequence. At each training step, it receives a text fragment as input and must generate a probability distribution for the next token.
  • Objective: To maximize the probability of the correct next token across vast amounts of text data. This seemingly simple task forces the model to learn grammar, syntax, world knowledge, and complex language patterns.

Applications

Due to their autoregressive nature, decoder-only models are ideally suited for any task that requires text generation:

  • Free-form text generation: Writing articles, poems, scripts, etc.
  • Conversational systems and chatbots: Answering user questions in a conversational style.
  • Summarization: Creating concise summaries of long texts.
  • Machine translation: Although encoder-decoder models are often used for this, decoder-only models can also handle translation if the task is framed in a prompt (e.g., "Translate from English to Russian: ...").
  • Code generation: Generating code from a text description.
  • In-context learning: Due to their scale, large decoder models demonstrate the ability to solve new tasks with just a few examples (few-shot) or even without any (zero-shot) provided directly in the prompt, without the need for fine-tuning.

Key Models and Their Evolution

  • GPT series (2018–present): The pioneers and popularizers of the approach. GPT-1 showed the effectiveness of pre-training, GPT-2 demonstrated the power of scaling, and GPT-3 introduced few-shot capabilities. ChatGPT and GPT-4 have made this architecture the standard for AI assistants.
  • LLaMA (2023–present): A series of open-source models from Meta that democratized access to powerful LLMs and spurred a wave of innovation in the community.
  • Claude (2023–present): A family of models from Anthropic focused on safety and controllability through Constitutional AI.
  • PaLM and Gemini (2022–present): Google's flagship models. Gemini is also a natively multimodal decoder-only model.

Comparison with Other Architectures

Comparison of Key Transformer-Based Architectures
Architecture Primary Task Context Direction Typical Models
Decoder-only Text generation Unidirectional (left-to-right) GPT, LLaMA, Claude, Gemini
Encoder-only Text understanding Bidirectional BERT, RoBERTa
Encoder-decoder Sequence-to-sequence transformation Bidirectional (encoder) + Unidirectional (decoder) T5, BART, the original Transformer

See Also

  • GPT