PaLM (Pathways Language Model)

From Systems Analysis Wiki
Jump to navigation Jump to search

PaLM (Pathways Language Model) is a family of large language models (LLMs) developed by Google. The first version of the model, introduced in April 2022, contained 540 billion parameters and became one of the largest language models in the world at the time, demonstrating breakthrough capabilities that resulted from massive scaling[1].

The key technological foundation of PaLM was Pathways, a new machine learning systems architecture from Google that enables the efficient coordination of distributed computations across thousands of accelerator chips[2]. PaLM was the first large-scale demonstration of this system, showcasing unprecedented training efficiency at an immense scale.

Pathways System: The Foundation for Scaling

The Pathways concept, introduced by Google in 2021, envisioned a single neural network capable of efficiently generalizing knowledge across different domains and performing thousands of tasks simultaneously. PaLM became the first large-scale application of this system: its training was parallelized across 6144 specialized TPU v4 processors, organized into two cloud clusters (TPU v4 Pods)[1].

At the time of its creation, this was the largest TPU configuration ever used to train a single model. The system achieved a record hardware utilization efficiency (57.8% FLOPs), which made it possible to significantly surpass previous projects in scale and successfully train a model with over half a trillion parameters[3].

Architecture and Training Data

Model Architecture

PaLM is a dense (non-sparse) language model with a decoder-only architecture, similar to models in the GPT series. This architecture is oriented towards next-token prediction tasks and is well-suited for text generation. Unlike the standard transformer architecture, PaLM uses several key modifications to enhance efficiency[1]:

  • Parallel Layers: The attention and feed-forward layers are computed in parallel, which accelerated training by approximately 15%.
  • SwiGLU Activation: Use of the SwiGLU activation function instead of the standard ReLU, which significantly improved model quality.

Training Data

PaLM was trained on a high-quality data corpus of 780 billion tokens. The dataset was multilingual and diverse, including[1]:

  • High-quality web documents and books.
  • Articles from Wikipedia.
  • Dialogues from social media (50% of the corpus).
  • Source code from GitHub (5% of the corpus).

Approximately 78% of the data was in English, while the remaining 22% was a multilingual set. A special "lossless" tokenization method was used, which preserved all whitespace (critical for code) and split unrecognized Unicode characters into bytes.

Capabilities and Results

Emergent Abilities and Few-Shot Learning

PaLM demonstrated that increasing the scale of the model, the volume of data, and the computational power can lead to emergent (unexpectedly arising) abilities. On many tasks, the model's performance increased sharply and non-linearly only upon reaching the maximum scale, indicating the appearance of new, previously unobserved capabilities[3].

The model was evaluated in few-shot learning mode (without fine-tuning, with a few examples in the prompt) and surpassed previous large models (such as GPT-3 and LaMDA) on 28 out of 29 popular NLP benchmarks. On the comprehensive BIG-bench suite of tasks, PaLM became the first model whose results surpassed the average level demonstrated by human testers[1].

Chain-of-Thought Reasoning

One of PaLM's most notable achievements was its ability for multi-step logical reasoning when using the "chain-of-thought prompting" technique[1]. This method involves providing the model with examples where the solution to a problem is broken down into steps. After learning from such examples, PaLM was able to generate its own "chain of thought" to solve new complex tasks, such as:

  • Mathematical Problems: On the GSM8K test (grade school-level math problems), PaLM solved 58% of the tasks, surpassing the previous state-of-the-art result achieved by a fine-tuned model.
  • Common Sense Tasks: The model was able to generate detailed explanations for non-trivial problems, for example, providing interpretations of previously unseen jokes.

This capability made the model's "thinking" process more transparent and human-like.

Code Generation and Multilingualism

Despite source code constituting only 5% of the training data, PaLM demonstrated performance comparable to the specialized OpenAI Codex model on code generation and transformation tasks. The model also showed strong capabilities in multilingual tasks, including translation[3].

Evolution and Successors: The PaLM Family

PaLM became the foundation for an entire family of models developed by Google.

PaLM 2

Introduced in May 2023, PaLM 2 became a more efficient and multilingual successor. Instead of pursuing a higher parameter count, the focus shifted to the quality of training data and architectural efficiency. PaLM 2 is trained on texts in over 100 languages and demonstrates improved capabilities in logic, programming, and translation[4]. The model is released in four sizes (from smallest to largest): Gecko, Otter, Bison, and Unicorn. The most compact version (Gecko) is lightweight enough to run on mobile devices offline.

Specialized Versions

Based on PaLM and PaLM 2, versions for specific domains were created:

  • Med-PaLM 2: A specialized model for medicine. It became the first AI system to achieve an expert level on questions from the US Medical Licensing Exam (USMLE)[4].
  • Sec-PaLM 2: A model focused on cybersecurity, trained to identify vulnerabilities and analyze malicious code[5].

PaLM-E: Multimodal Version

PaLM-E (Pathways Language Model Embodied) is a multimodal model that combines the PaLM language model with visual data from a Vision Transformer (ViT). This allows the model to process both text and images, solving tasks related to the physical world, such as controlling robots[6].

Ethical Aspects and Limitations

The creators of PaLM emphasize the need for a responsible approach to developing large language models. The official scientific paper included an analysis of potential biases and toxicity in the generated text. To ensure transparency, Google published a Model Card and a Datasheet for PaLM, documenting the dataset's characteristics, testing results, and identified limitations[1]. These measures align with modern practices for responsible AI and are intended to mitigate risks associated with biases and the generation of harmful content.

Further Reading

  • Chowdhery, A. et al. (2022). PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311.
  • Anil, R. et al. (2023). PaLM 2 Technical Report. arXiv:2305.10403.
  • Driess, D. et al. (2023). PaLM-E: An Embodied Multimodal Language Model. arXiv:2303.03378.
  • Singhal, K. et al. (2022). Large Language Models Encode Clinical Knowledge. arXiv:2212.13138.
  • Singhal, K. et al. (2023). Towards Expert-Level Medical Question Answering with Large Language Models. arXiv:2305.09617.
  • Barham, P. et al. (2022). Pathways: Asynchronous Distributed Dataflow for ML. arXiv:2203.12533.
  • Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.
  • Zhang, Z. et al. (2022). Automatic Chain of Thought Prompting in Large Language Models. arXiv:2210.03493.
  • Wei, J. et al. (2022). Emergent Abilities of Large Language Models. arXiv:2206.07682.
  • Schaeffer, R. et al. (2023). Are Emergent Abilities of Large Language Models a Mirage?. arXiv:2304.15004.
  • Lu, S. et al. (2023). Are Emergent Abilities in Large Language Models just In-Context Learning?. arXiv:2309.01809.
  • Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.
  • Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models. arXiv:2203.15556.
  • Rae, J. W. et al. (2021). Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv:2112.11446.
  • Diao, S. et al. (2023). Active Prompting with Chain-of-Thought for Large Language Models. arXiv:2302.12246.

References

  1. 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Chowdhery, Aakanksha; Narang, Sharan; Devlin, Jacob; et al. "PaLM: Scaling Language Modeling with Pathways". arXiv. [1]
  2. "Introducing Pathways: A next-generation AI architecture". Google AI Blog. [2]
  3. 3.0 3.1 3.2 "Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance". Google Research Blog. [3]
  4. 4.0 4.1 "Google AI: What to know about the PaLM 2 large language model". Google AI Blog. [4]
  5. "New AI capabilities that can help address your security challenges". Google Cloud Blog. [5]
  6. "PaLM-E: An embodied multimodal language model". Google Research Blog. [6]