Decoder (Transformer)
Decoder — in machine learning and deep learning, this is a neural network component whose primary task is to transform an encoded representation (e.g., a context vector or a sequence of hidden states from an encoder) into an output data sequence (e.g., text or an image). The decoder generates output data step-by-step, typically in an autoregressive manner.
In a broader sense, in information theory, a decoder is any device or algorithm that converts encoded data back into its original or understandable format.
Concept and Purpose
While an encoder is responsible for understanding and compressing input data, a decoder is responsible for generating and expanding output data. It takes a compact, information-rich representation and sequentially transforms it into the desired format, whether it's a sentence in another language, a text description of an image, or a sequence of musical notes.
A key characteristic of most decoders is their autoregressive nature: to generate the next element in a sequence (e.g., a word or a pixel), they use both the encoded representation and all previously generated elements.
Decoder in Different Architectures
Decoder in an Autoencoder
In an autoencoder architecture, the decoder performs the inverse task of the encoder:
- The Encoder compresses input data into a latent representation.
- The Decoder takes this latent representation and attempts to restore (reconstruct) the original data from it.
Training such a network allows the decoder to be used independently for generating new data by feeding vectors from the latent space into its input.
Decoder in Recurrent Neural Networks (RNN/LSTM)
In classic sequence-to-sequence (seq2seq) models based on RNNs or LSTMs, the decoder is a recurrent network that generates the output sequence token by token.
- Operating Principle: The decoder is initialized with the final hidden state (context vector) of the encoder. At each step, it takes the previously generated token and its own previous hidden state as input, then generates the next token and updates its state. This process continues until a special end-of-sequence token (`<EOS>`) is generated.
Decoder in the Transformer Architecture
The decoder based on the Transformer architecture also operates autoregressively, but its internal structure is different. It consists of a stack of identical layers, each containing three main sub-layers:
- Masked Multi-Head Self-Attention: This mechanism works similarly to the one in the encoder, but with one crucial difference: masking. The mask prevents each position from "attending" to subsequent positions in the sequence. This ensures that the prediction for position depends only on the known outputs at positions less than , preserving the autoregressive property.
- Multi-Head Cross-Attention: This is the key mechanism connecting the decoder to the encoder. Here, the queries (Query) come from the previous decoder layer, while the keys (Key) and values (Value) come from the encoder's output representations. This allows the decoder to focus on the most relevant parts of the input sequence at each generation step.
- Feed-Forward Network: Similar to the one used in the encoder.
Types of Decoder-Based Models
Encoder-Decoder Models
This is a classic architecture where the encoder and decoder work as a pair.
- Operating Principle: The encoder creates a representation of the input data, and the decoder generates the output data using this representation.
- Examples: The original Transformer, T5, BART.
Decoder-Only Models
These models, which have become dominant in generative AI, exclusively use a stack of Transformer decoders.
- Operating Principle: In these models, there is no cross-attention to an encoder, as there is no encoder. The model operates purely in an autoregressive mode, predicting the next token based on all previous tokens in the same sequence. The input prompt and the already generated text are processed together.
- Applications: Ideal for tasks that require continuing a given text (text generation, dialogue systems, chatbots).
- Examples: The GPT series, LLaMA, Claude.
Connection with the Encoder
The interaction between the encoder and the decoder is a fundamental principle for transformation tasks.
- The Encoder compresses information about the input sequence into a set of vectors.
- The Decoder uses this set of vectors to sequentially generate a new sequence.
Cross-attention allows the decoder at each step to "consult" with the encoder to understand which part of the original text to focus on at that moment. For example, when translating a sentence, a decoder generating a German word can "look" at the corresponding English word from the input.
See Also
- GPT
References
- Vaswani, A. et al. (2017). Attention Is All You Need. NIPS.