Pre-training of large language models
Pre-training of large language models (LLMs) is a fundamental stage in the creation of modern large language models, which involves training them on vast and diverse collections of unlabeled text. This process allows the models to learn general linguistic patterns, world knowledge, and semantic relationships, forming a so-called foundation model, which can then be adapted to solve specific tasks.
What is pre-training?
Pre-training is the initial training phase in which an LLM is trained on large-scale text datasets using self-supervised learning methods. This means that the training signals (labels) are generated from the data itself, without the need for manual human annotation.
The primary goal of this stage is to predict hidden or future parts of the text. Depending on the architecture, two main tasks are used:
- Causal Language Modeling (CLM): The model learns to predict the next word (token) in a sequence based on all preceding ones. This approach is the foundation of generative models like GPT.
- Masked Language Modeling (MLM): The model learns to restore randomly "masked" (hidden) words in the text by using the surrounding bidirectional context (words to the left and right). This method is used in models like BERT.
Through these tasks, the model is forced to learn syntax, semantics, and factual knowledge about the world to make successful predictions.
Data for pre-training
The effectiveness of pre-training depends heavily on the quality and diversity of the training data. The following main sources are used:
- Web pages: Datasets such as Common Crawl and C4 provide a wide range of topics, styles, and languages, representing a "snapshot" of the internet.
- Books: Corpora like BookCorpus and The Pile provide structured and coherent text, which is useful for understanding long-term dependencies and narratives.
- Conversational data: Data from forums (e.g., Reddit) and social networks that help models learn informal language and dialogue patterns.
- Specialized data: Scientific articles (from arXiv), source code (from GitHub and The Stack), or multilingual texts to enhance the model's specific capabilities.
Examples of data distribution
Different models use different ratios of sources, which affects their final abilities:
- GPT-3 (175B parameters): 16% books, 84% web pages.
- PaLM (540B parameters): 5% books, 14% web pages, 50% conversational data, 31% other.
- LLaMA (65B parameters): 5% books, 2% web pages, 87% conversational data.
These distributions show that data selection is a strategic decision that varies depending on the model's objectives.
Frequently used corpora
| Corpus | Size | Source | Last update |
|---|---|---|---|
| BookCorpus | 5GB | Books | Dec-2015 |
| Gutenberg | - | Books | Dec-2021 |
| C4 | 800GB | Common Crawl | Apr-2019 |
| CC-Stories-R | 31GB | Common Crawl | Sep-2019 |
| CC-NEWS | 78GB | Common Crawl | Feb-2019 |
| REALNEWS | 120GB | Common Crawl | Apr-2019 |
| OpenWebText | 38GB | Reddit links | Mar-2023 |
| Pushshift.io | 2TB | Reddit links | Mar-2023 |
| Wikipedia | 21GB | Wikipedia | Mar-2023 |
| The Pile | 800GB | Other | Dec-2020 |
| ROOTS | 1.6TB | Other | Jun-2022 |
Training techniques
Pre-training LLMs requires significant computational resources. The following techniques are used to manage this process:
- Distributed training: Using multiple GPUs or TPUs for parallel processing.
- Mixed Precision: Using lower-precision numerical formats (e.g., 16-bit instead of 32-bit) to speed up computations and reduce memory usage.
- Gradient Checkpointing: A technique to save memory by recomputing rather than storing some intermediate activations.
- Model Parallelism: Distributing the model itself across multiple devices.
Training a model like GPT-3 can take several months on thousands of GPUs.
Scaling laws
Research, such as the work by OpenAI (Kaplan et al., 2020), has shown that the performance of language models improves predictably as three factors increase:
- Model size (number of parameters).
- Data volume.
- Computational resources.
These empirical relationships, known as scaling laws, guide developers in designing and training larger and more powerful models, allowing them to allocate their computational budget optimally.
Challenges and achievements
- Scaling: The main achievement of pre-training is the ability to balance model size, data, and computation to achieve optimal performance.
- Data Quality: Ensuring the cleanliness, diversity, and absence of bias in training data is a key challenge.
- Efficiency: Developing methods to reduce computational costs, such as continual pre-training or more efficient architectures.
- Multilinguality: Creating models capable of effectively processing multiple languages requires careful data selection and balancing.
See also
- Large language models
- BERT
- GPT