Encoder–decoder architecture
Encoder-Decoder models are a class of neural network architectures designed for solving sequence-to-sequence (seq2seq) tasks. This architecture consists of two main components:
- Encoder: Processes the input sequence and compresses it into a compact numerical representation (a context vector or a sequence of hidden states).
- Decoder: Takes this representation and generates an output sequence based on it.
This architecture is fundamental to many tasks in Natural Language Processing (NLP) and computer vision, such as machine translation, text summarization, and image captioning.
Concept
The core idea of the encoder-decoder architecture is the separation of understanding and generation tasks. The encoder is responsible for understanding the input sequence, extracting all the necessary semantic information from it. The decoder is responsible for generating a new sequence using the information provided by the encoder.
This allows the model to handle input and output sequences of different lengths, which was a challenge for earlier architectures.
Architectural Evolution
Early Models Based on RNN/LSTM
Initially, the encoder-decoder architecture was implemented using Recurrent Neural Networks (RNNs) or their advanced variant, Long Short-Term Memory (LSTM).
- Encoder: An RNN-based encoder processed the input sequence token by token and produced a single context vector—the hidden state after processing the last token, which was intended to contain information about the entire sequence.
- Decoder: An RNN-based decoder was initialized with this context vector and generated the output sequence autoregressively.
The main drawback of this approach was the "bottleneck" problem: all information from the input sequence had to be compressed into a single fixed-length vector, leading to information loss, especially for long sequences.
Introduction of the Attention Mechanism
A breakthrough occurred with the introduction of the attention mechanism (Bahdanau et al., 2014).
- How it works: Instead of relying on a single context vector, the decoder, at each generation step, "pays attention" to all of the encoder's hidden states. It calculates attention weights that indicate which parts of the input sequence are most relevant for generating the current output token.
- Advantages: This solved the bottleneck problem and enabled the model to work effectively with long sequences, significantly improving quality, especially in machine translation.
The Transformer Architecture
In 2017, the paper "Attention Is All You Need" introduced the Transformer architecture, which completely abandoned recurrence in favor of the attention mechanism.
- Transformer Encoder: Consists of a stack of layers, each using self-attention to create contextualized representations for every input token.
- Transformer Decoder: Consists of a stack of layers, each featuring two types of attention mechanisms:
- Masked Self-Attention: To process the already generated part of the output sequence.
- Cross-Attention: To "pay attention" to the encoder's output representations.
This architecture became the standard for seq2seq tasks due to its high performance and parallelization capabilities.
Applications
The encoder-decoder architecture is a standard for a wide range of tasks:
- Machine Translation: Translating a sentence from one language to another.
- Automatic Text Summarization: Creating a concise summary of a long document.
- Dialogue Systems: Generating a response to a user's query.
- Image Captioning: An encoder (often a Convolutional Neural Network) processes an image, and a decoder (often an RNN or a Transformer) generates a textual description.
- Speech Recognition: Converting an audio signal into a text transcript.
Key Models
- The Original Transformer (Vaswani et al., 2017): The model that introduced this architecture.
- BART (Bidirectional and Auto-Regressive Transformers): A model from Facebook that is pre-trained on the task of reconstructing "corrupted" text. The encoder is bidirectional (like in BERT), and the decoder is autoregressive (like in GPT).
- T5 (Text-to-Text Transfer Transformer): A model from Google that unifies all NLP tasks by framing them as text-to-text problems. T5 has shown outstanding results on numerous benchmarks.
Comparison with Other Architectures
| Architecture | Primary Task | Components | Typical Models |
|---|---|---|---|
| Encoder-Decoder | Sequence-to-Sequence Transformation | Encoder + Decoder | T5, BART, the original Transformer |
| Encoder-Only | Text Understanding | Encoder Only | BERT, RoBERTa |
| Decoder-Only | Text Generation | Decoder Only |
References
- Bahdanau, D. et al. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473.
- Sutskever, I. et al. (2014). Sequence to Sequence Learning with Neural Networks. arXiv:1409.3215.
- Luong, M.-T. et al. (2015). Effective Approaches to Attention-based Neural Machine Translation. arXiv:1508.04025.
- Wu, Y. et al. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144.
- Vaswani, A. et al. (2017). Attention Is All You Need. arXiv:1706.03762.
- Britz, D. et al. (2017). Massive Exploration of Neural Machine Translation Architectures. arXiv:1703.03906.
- Lewis, M. et al. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation. arXiv:1910.13461.
- Raffel, C. et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683.
- Dong, L. et al. (2019). Unified Language Model Pre-training for Natural Language Understanding and Generation. arXiv:1905.03197.
- Zhang, J. et al. (2020). PEGASUS: Pre-training with Extracted Gap-Sentences for Abstractive Summarization. arXiv:1912.08777.