BLOOM (language model)
BLOOM (BigScience Large Open-science Open-access Multilingual Language Model) is a large, open-access language model (LLM) containing 176 billion parameters. It was developed in 2022 as part of the BigScience project—an international collaboration of over 1,000 researchers from 70 countries, spearheaded by Hugging Face[1].
BLOOM is an autoregressive transformer model capable of generating coherent text in 46 natural languages and 13 programming languages. The model was trained on the Jean Zay supercomputer in France and became one of the first truly open-access alternatives to closed models like GPT-3 from OpenAI[2].
Background and Development
The BigScience initiative was launched in May 2021 with the goal of democratizing AI research by collaboratively creating a large, open-source language model[1]. At that time, state-of-the-art LLMs like GPT-3 were developed in a closed-source manner by large corporations that did not disclose their architecture, training data, or source code. The BigScience project brought together over a thousand volunteer researchers from around the world to create a competitive and fully open model.
The project received a grant for computing resources on the French supercomputer Jean Zay (IDRIS/CNRS). The model's training took place from March 11 to July 6, 2022[3]. The development was conducted with maximum transparency: the team published information about data selection, training configurations, and held public discussions, following the project's adopted ethical charter.
Architecture and Training
Model Architecture
BLOOM is built on a decoder-only autoregressive transformer architecture, similar to the GPT-3 model[2].
| Parameter | Value |
|---|---|
| Type | Decoder-only transformer |
| Parameters | 176,247,271,424 |
| Layers | 70 |
| Attention heads | 112 |
| Hidden size | 14,336 |
| Sequence length | 2048 tokens |
| Activation function | GeLU; ALiBi positional encodings |
The model was implemented using the Megatron-LM and DeepSpeed frameworks, developed by Nvidia and Microsoft, respectively, with a number of modifications for efficient distributed training[5].
Training Data
BLOOM was trained on the specially created ROOTS (The Responsible Open-science Open-collaboration Text Sources) text corpus. The total data volume was 1.6 terabytes of cleaned and deduplicated text (≈366 billion tokens)[6].
The corpus includes texts in 59 languages:
- 46 natural languages, including English (30% of tokens), Chinese, French, Spanish, Arabic, as well as many low-resource languages (e.g., Chi Tumbuka — 0.00002% of tokens).
- 13 programming languages, including Python, Java, JavaScript, and C++.
This multilingual and multi-domain dataset was intentionally compiled to make the model useful for a wide range of language communities.
Performance and Application
BLOOM demonstrates competitive results on various benchmarks, comparable to models of a similar size, such as OPT-175B from Meta, despite its multilingual nature[2].
The model is capable of performing a wide range of tasks in a zero-shot setting (without additional training), including:
- Generating text in a given style.
- Summarizing documents.
- Answering questions based on context.
- Translating between languages.
- Generating simple program code.
To improve its practical utility, the BigScience team later conducted additional multitask fine-tuning, creating the BLOOMZ version, which follows user instructions more accurately.
Licensing and Open Access
The full 176-billion-parameter BLOOM model, its source code, and data were released in July 2022. The model is distributed under the specially developed RAIL (Responsible AI License) v1.0[7].
This license permits free use and modification of the model but imposes a series of restrictions on its application in certain areas. In particular, it is forbidden to use BLOOM for purposes that contradict the BigScience ethical norms, such as:
- Mass surveillance.
- Algorithmic discrimination.
- Spreading disinformation.
- Controlling lethal autonomous weapons.
BLOOM became the first major AI model released under a license with explicit clauses about responsible use[8].
Literature
- Hendrycks, D.; Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv:1606.08415.
- Shoeybi, M.; et al. (2019). Megatron‑LM: Training Multi‑Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053.
- Rajbhandari, S.; et al. (2020). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. arXiv:1910.02054.
- Press, O.; et al. (2021). Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. arXiv:2108.12409.
- Le Scao, T.; et al. (2022). BLOOM: A 176B‑Parameter Open‑Access Multilingual Language Model. arXiv:2211.05100.
- Muennighoff, N.; et al. (2022). BLOOMZ & mT0: A Cross‑Lingual Baseline for Instruction Finetuning. arXiv:2211.01786.
- BigScience Workshop (2022). BigScience OpenRAIL‑M License v1.0. Online specification.
- Akiki, C.; et al. (2022). BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model. arXiv:2212.04960.
- Yong, Z.‑X.; et al. (2022). BLOOM+1: Adding Language Support to BLOOM for Zero‑Shot Prompting. arXiv:2212.09535.
- Biderman, S.; et al. (2023). The BigScience ROOTS Corpus: A 1.6 TB Composite Multilingual Dataset. arXiv:2303.03915.
Notes
- ↑ 1.0 1.1 "BLOOM". BigScience Blog. [1]
- ↑ 2.0 2.1 2.2 Le Scao, T., et al. (2022). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model". arXiv:2211.05100. [2]
- ↑ "Researchers open-source neural network with 176B parameters". SiliconANGLE. [3]
- ↑ "bigscience/bloom". Hugging Face. [4]
- ↑ "The Technology Behind BLOOM Training". Hugging Face Blog. [5]
- ↑ Biderman, S. et al. (2023). "The BigScience ROOTS Corpus: A 1.6 TB Composite Multilingual Dataset". arXiv:2303.03915. [6]
- ↑ "BigScience OpenRAIL-M". BigScience Blog. [7]
- ↑ Heikkilä, M. "BLOOM is the first AI model to be under a...". X. [8]