T5 (Text-to-Text Transfer Transformer)

T5 (Text-to-Text Transfer Transformer) is a family of large language models developed by researchers at Google AI and introduced in 2019^[1]. The key innovation of T5 is its unified "text-to-text" framework, which treats every natural language processing (NLP) task as a problem of converting one text sequence into another. This allowed for the use of a single model, loss function, and training procedure for a wide range of tasks, such as translation, summarization, question answering, and classification^[2].

The model is based on the standard encoder-decoder transformer architecture, which distinguishes it from models like BERT (encoder-only) and GPT (decoder-only). The work on T5 was conceived as a large-scale empirical study to systematically explore and compare various transfer learning techniques in NLP, rather than creating a fundamentally new method^[1].

The "Text-to-Text" Paradigm

The central idea of T5 is that all tasks are formulated in a unified format. The model receives text as input and also generates text as output. To enable the model to distinguish between the tasks it is given, a special text instruction prefix is added to the input sequence^[2].

Translation: `translate English to German: That is good.` → `Das ist gut.`
Sentiment Classification: `sst2 sentence: a very exciting film.` → `positive`
Summarization: `summarize: [long article text]` → `[short summary]`

This approach radically simplifies the process of applying the model, eliminating the need to develop specific task-specific heads for each individual task, which was characteristic of architectures like BERT^[3].

Architecture and Scaling

Encoder-Decoder Architecture

T5 uses a standard transformer architecture consisting of two parts^[1]:

Encoder: Processes the entire input sequence at once, creating a rich, contextualized representation. Like in BERT, the T5 encoder is bidirectional.
Decoder: Generates the output text token by token (autoregressively), using the representation provided by the encoder.

This hybrid structure allows T5 to effectively solve both language understanding and text generation tasks^[4].

Key Improvements

The T5 architecture includes several changes compared to the original transformer model:

Relative Positional Embeddings: Instead of absolute sinusoidal embeddings, T5 uses a simplified but effective form of relative position encoding, where a learnable scalar bias is added to the attention logits, depending only on the relative distance between tokens^[1].
Modified Layer Normalization (Layer Norm): Normalization is moved outside the residual connection, and the additive bias is removed to improve training stability.

Model Scales

In the original paper, the model was presented in several configurations with varying numbers of parameters, which allowed for a systematic study of the effect of scale^[5]:

T5-Small: ~60 million parameters
T5-Base: ~220 million parameters
T5-Large: ~770 million parameters
T5-3B: ~3 billion parameters
T5-11B: ~11 billion parameters

The study showed that increasing the model's scale is one of the most reliable ways to improve its performance^[1].

Pre-training: C4 Dataset and the Span Corruption Task

The Span Corruption Task

For pre-training T5, a denoising objective was chosen, specifically a variant called span corruption^[6]. The method works as follows:

In the input text, 15% of the tokens are randomly masked.
Unlike the MLM method in BERT, where individual tokens are masked, T5 masks entire contiguous spans of tokens.
Each corrupted span is replaced by a single unique sentinel token (e.g., `<X>`, `<Y>`).
The model is trained to generate the sequence of dropped-out spans at the output, separated by the corresponding sentinel tokens.

This approach forces the model to predict entire sequences of text, which proved to be a more effective pre-training task than simple language modeling^[1].

The C4 Dataset (Colossal Clean Crawled Corpus)

To realize the potential of transfer learning, the researchers created a massive and high-quality cleaned text dataset called C4, with a size of about 750 GB^[2]. It was derived from a large-scale cleaning and filtering of the publicly available Common Crawl web corpus^[7]. The cleaning process included removing duplicates, boilerplate text ("Lorem ipsum"), incomplete sentences, and filtering out offensive language^[8].

Criticism of the C4 Dataset

Despite the stated goal of creating a "clean" corpus, the C4 filtering process was criticized for systemic biases. Studies showed that the profanity filter disproportionately removed texts related to LGBTQ+ communities, as well as texts in African-American English (AAE)^[8]. Additionally, a significant amount of offensive and copyrighted content was found in the dataset. These issues illustrate the difficulty of creating objectively "high-quality" datasets and how technical filtering decisions can lead to unintended social biases.

Results and Performance

At the time of its publication, T5 set new state-of-the-art performance records on numerous benchmarks, including GLUE, SuperGLUE, SQuAD, and summarization tasks^[2]. In particular, the T5-11B model achieved a near-human level score on SuperGLUE, demonstrating its ability to handle tasks requiring complex logical reasoning^[9]. These results confirmed the study's central hypothesis: the combination of a unified framework, massive scale, and a high-quality dataset is an extremely powerful strategy for achieving cutting-edge results in NLP.

Evolution and Variants of T5

The T5 approach served as the foundation for many subsequent models:

mT5: A multilingual version of T5 trained on the mC4 corpus, covering 101 languages^[10].
ByT5: An experimental version that completely dispenses with tokenization and operates directly on raw UTF-8 bytes. This makes it robust to typos and allows it to process any language "out of the box"^[11].
Switch Transformer: A scalable version of T5 that introduced the Mixture-of-Experts (MoE) architecture, allowing the number of parameters to be increased to trillions while maintaining reasonable computational costs^[12].
FLAN-T5: This is not a new architecture but a standard T5 that has undergone an additional fine-tuning step on hundreds of tasks formulated as instructions (instruction tuning). This significantly improved its ability to generalize to new, unseen tasks in a zero-shot setting (without examples)^[13].
UL2: A model that builds on the ideas of T5, using a new pre-training objective called Mixture of Denoisers, which combines various text masking schemes to improve versatility^[14].

Links

Literature

Vaswani, A. et al. (2017). Attention Is All You Need. arXiv:1706.03762.
Raffel, C. et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683.
Xue, L. et al. (2021). mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. arXiv:2010.11934.
Dodge, J. et al. (2021). Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. arXiv:2104.08758.
Fedus, W. et al. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv:2101.03961.
Ni, J. et al. (2021). Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. arXiv:2108.08877.
Guo, M. et al. (2021). LongT5: Efficient Text-To-Text Transformer for Long Sequences. arXiv:2112.07916.
Xue, L. et al. (2022). ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. arXiv:2105.13626.
Tay, Y. et al. (2022). UL2: Unifying Language Learning Paradigms. arXiv:2205.05131.
Chung, H. W. et al. (2022). Scaling Instruction-Finetuned Language Models. arXiv:2210.11416.
Longpre, S. et al. (2023). The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. arXiv:2301.13688.

Notes

↑ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 Raffel, Colin; Shazeer, Noam; Roberts, Adam; et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". Journal of Machine Learning Research. [1]
↑ ^2.0 ^2.1 ^2.2 ^2.3 "Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer". Google Research Blog. [2]
↑ "A Detailed Look At Google's T5 Model in NLP". DhiWise Blog. [3]
↑ "T5 (Text-to-Text Transfer Transformer)". GeeksforGeeks. [4]
↑ "T5". Hugging Face Transformers Documentation. [5]
↑ "T5 (language model)". In Wikipedia. [6]
↑ "C4 Dataset". Papers With Code. [7]
↑ ^8.0 ^8.1 Dodge, J.; Sap, M.; Marasović, A.; et al. "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus". arXiv. [8]
↑ "Google T5 algorithm scores 88.9 on SuperGLUE languge benchmark, compared to 89.8 human baseline". Reddit, r/linguistics. [9]
↑ Xue, Linting; Constant, Noah; Roberts, Adam; et al. "mT5: A massively multilingual pre-trained text-to-text transformer". arXiv. [10]
↑ Xue, Linting; Barua, Aditya; Constant, Noah; et al. "ByT5: Towards a token-free future with pre-trained byte-to-byte models". arXiv. [11]
↑ Fedus, William; Zoph, Barret; Shazeer, Noam. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity". arXiv. [12]
↑ Chung, Hyung Won; et al. "Scaling Instruction-Finetuned Language Models". arXiv. [13]
↑ Tay, Yi; Dehghani, Mostafa; Tran, Vinh; et al. "UL2: Unifying Language Learning Paradigms". arXiv. [14]

[raffel2020-1] 1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 Raffel, Colin; Shazeer, Noam; Roberts, Adam; et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". Journal of Machine Learning Research. [1]

[google-blog-t5-2] 2.0 ^2.1 ^2.2 ^2.3 "Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer". Google Research Blog. [2]

[dhiwise-t5-3] "A Detailed Look At Google's T5 Model in NLP". DhiWise Blog. [3]

[gfg-t5-4] "T5 (Text-to-Text Transfer Transformer)". GeeksforGeeks. [4]

[huggingface-t5-doc-5] "T5". Hugging Face Transformers Documentation. [5]

[t5-wikipedia-6] "T5 (language model)". In Wikipedia. [6]

[c4-dataset-pwc-7] "C4 Dataset". Papers With Code. [7]

[dodge2021-doc-c4-8] 8.0 ^8.1 Dodge, J.; Sap, M.; Marasović, A.; et al. "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus". arXiv. [8]

[reddit-superglue-t5-9] "Google T5 algorithm scores 88.9 on SuperGLUE languge benchmark, compared to 89.8 human baseline". Reddit, r/linguistics. [9]

[xue2020-mt5-10] Xue, Linting; Constant, Noah; Roberts, Adam; et al. "mT5: A massively multilingual pre-trained text-to-text transformer". arXiv. [10]

[xue2022-byt5-11] Xue, Linting; Barua, Aditya; Constant, Noah; et al. "ByT5: Towards a token-free future with pre-trained byte-to-byte models". arXiv. [11]

[fedus2021-switch-12] Fedus, William; Zoph, Barret; Shazeer, Noam. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity". arXiv. [12]

[chung2022-flan-13] Chung, Hyung Won; et al. "Scaling Instruction-Finetuned Language Models". arXiv. [13]

[tay2022-ul2-14] Tay, Yi; Dehghani, Mostafa; Tran, Vinh; et al. "UL2: Unifying Language Learning Paradigms". arXiv. [14]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

T5 (Text-to-Text Transfer Transformer)

Contents

The "Text-to-Text" Paradigm

Architecture and Scaling

Encoder-Decoder Architecture

Key Improvements

Model Scales

Pre-training: C4 Dataset and the Span Corruption Task

The Span Corruption Task

The C4 Dataset (Colossal Clean Crawled Corpus)

Criticism of the C4 Dataset

Results and Performance

Evolution and Variants of T5

Links

Literature

Notes

Navigation menu

T5 (Text-to-Text Transfer Transformer)

The "Text-to-Text" Paradigm

Architecture and Scaling

Encoder-Decoder Architecture

Key Improvements

Model Scales

Pre-training: C4 Dataset and the Span Corruption Task

The Span Corruption Task

The C4 Dataset (Colossal Clean Crawled Corpus)

Criticism of the C4 Dataset

Results and Performance

Evolution and Variants of T5

Links

Literature

Notes

Navigation menu

Search