Pre-training of large language models

Pre-training of large language models (LLMs) is a fundamental stage in the creation of modern large language models, which involves training them on vast and diverse collections of unlabeled text. This process allows the models to learn general linguistic patterns, world knowledge, and semantic relationships, forming a so-called foundation model, which can then be adapted to solve specific tasks.

What is pre-training?

Pre-training is the initial training phase in which an LLM is trained on large-scale text datasets using self-supervised learning methods. This means that the training signals (labels) are generated from the data itself, without the need for manual human annotation.

The primary goal of this stage is to predict hidden or future parts of the text. Depending on the architecture, two main tasks are used:

Causal Language Modeling (CLM): The model learns to predict the next word (token) in a sequence based on all preceding ones. This approach is the foundation of generative models like GPT.
Masked Language Modeling (MLM): The model learns to restore randomly "masked" (hidden) words in the text by using the surrounding bidirectional context (words to the left and right). This method is used in models like BERT.

Through these tasks, the model is forced to learn syntax, semantics, and factual knowledge about the world to make successful predictions.

Data for pre-training

The effectiveness of pre-training depends heavily on the quality and diversity of the training data. The following main sources are used:

Web pages: Datasets such as Common Crawl and C4 provide a wide range of topics, styles, and languages, representing a "snapshot" of the internet.
Books: Corpora like BookCorpus and The Pile provide structured and coherent text, which is useful for understanding long-term dependencies and narratives.
Conversational data: Data from forums (e.g., Reddit) and social networks that help models learn informal language and dialogue patterns.
Specialized data: Scientific articles (from arXiv), source code (from GitHub and The Stack), or multilingual texts to enhance the model's specific capabilities.

Examples of data distribution

Different models use different ratios of sources, which affects their final abilities:

GPT-3 (175B parameters): 16% books, 84% web pages.
PaLM (540B parameters): 5% books, 14% web pages, 50% conversational data, 31% other.
LLaMA (65B parameters): 5% books, 2% web pages, 87% conversational data.

These distributions show that data selection is a strategic decision that varies depending on the model's objectives.

Frequently used corpora

Frequently used datasets for LLM pre-training
Corpus	Size	Source	Last update
BookCorpus	5GB	Books	Dec-2015
Gutenberg	-	Books	Dec-2021
C4	800GB	Common Crawl	Apr-2019
CC-Stories-R	31GB	Common Crawl	Sep-2019
CC-NEWS	78GB	Common Crawl	Feb-2019
REALNEWS	120GB	Common Crawl	Apr-2019
OpenWebText	38GB	Reddit links	Mar-2023
Pushshift.io	2TB	Reddit links	Mar-2023
Wikipedia	21GB	Wikipedia	Mar-2023
The Pile	800GB	Other	Dec-2020
ROOTS	1.6TB	Other	Jun-2022

Training techniques

Pre-training LLMs requires significant computational resources. The following techniques are used to manage this process:

Distributed training: Using multiple GPUs or TPUs for parallel processing.
Mixed Precision: Using lower-precision numerical formats (e.g., 16-bit instead of 32-bit) to speed up computations and reduce memory usage.
Gradient Checkpointing: A technique to save memory by recomputing rather than storing some intermediate activations.
Model Parallelism: Distributing the model itself across multiple devices.

Training a model like GPT-3 can take several months on thousands of GPUs.

Scaling laws

Research, such as the work by OpenAI (Kaplan et al., 2020), has shown that the performance of language models improves predictably as three factors increase:

Model size (number of parameters).
Data volume.
Computational resources.

These empirical relationships, known as scaling laws, guide developers in designing and training larger and more powerful models, allowing them to allocate their computational budget optimally.

Challenges and achievements

Scaling: The main achievement of pre-training is the ability to balance model size, data, and computation to achieve optimal performance.
Data Quality: Ensuring the cleanliness, diversity, and absence of bias in training data is a key challenge.
Efficiency: Developing methods to reduce computational costs, such as continual pre-training or more efficient architectures.
Multilinguality: Creating models capable of effectively processing multiple languages requires careful data selection and balancing.

Pre-training of large language models

Contents

What is pre-training?

Data for pre-training

Examples of data distribution

Frequently used corpora

Training techniques

Scaling laws

Challenges and achievements

See also

Navigation menu

Pre-training of large language models

What is pre-training?

Data for pre-training

Examples of data distribution

Frequently used corpora

Training techniques

Scaling laws

Challenges and achievements

See also

Navigation menu

Search