Gemma (Google)

From Systems Analysis Wiki
Jump to navigation Jump to search

Gemma is a family of freely available language models developed and released by Google (specifically, the Google DeepMind division). The Gemma models are based on the same research and technology as the flagship Gemini family and are positioned as their lightweight, high-performance versions[1]. The name comes from the Latin word gemma, meaning "gemstone"[2].

Gemma belongs to the category of open models: Google releases the model weights, allowing researchers and developers to freely use, fine-tune, and distribute them, including for commercial projects, provided they adhere to the terms of responsible use[2]. This is a key difference from the Gemini models, which are only accessible via cloud APIs. Gemma models can run locally on consumer hardware (laptops, desktops with GPUs), not just in data centers[3].

Development and Releases

The Gemma family includes several generations of models, each introducing improvements in architecture, performance, and capabilities.

First Generation: Gemma 1

The first version of Gemma was released on February 21, 2024[4]. It included two text models based on a decoder-only transformer architecture:

  • Gemma 2B (2 billion parameters)
  • Gemma 7B (7 billion parameters)

At the time of their release, Google claimed that these models outperformed significantly larger counterparts on key benchmarks[2]. The initial models were predominantly English-language but were trained on a diverse range of data, including web documents, software code, and mathematical problems[1]. Both models were released in two variants: a base (pre-trained) version and an instruction-tuned version for better adherence to user commands[2].

Second Generation: Gemma 2

Gemma 2 was announced on June 27, 2024, and brought significant improvements[1].

  • Model Sizes: Models with 9 and 27 billion parameters were released. The smaller variants were trained using knowledge distillation from the larger model to enhance their quality[5].
  • Context Window: The context window was significantly expanded to 80,000 tokens (compared to 8,192 in the first version)[6][7].
  • Architectural Improvements: Mechanisms such as grouped-query attention and an alternating scheme of local and global attention were introduced for efficient processing of long contexts[1].

Third Generation: Gemma 3

Gemma 3 was introduced in March 2025 as the next step in the family's development, with a focus on multimodality and an expanded range of tasks[6].

  • Multimodality: The models gained support for images and video as input, alongside text.
  • Sizes and Languages: The model lineup covers four sizes (1B, 4B, 12B, 27B) and supports up to 140 languages[6].
  • Context Window: Increased to 128,000 tokens[6].

According to Google, Gemma 3 27B demonstrated performance on par with the best open models of its time, ranking behind only specialized models like DeepSeek-R1[6].

Architecture and Technical Features

Gemma models are based on a decoder-only transformer architecture, similar to GPT models[7]. This means the model generates text autoregressively, predicting the next token based on all previous ones. Key technical solutions include:

  • Rotary Position Embeddings (RoPE): Instead of absolute positional embeddings, RoPE is used to efficiently encode positional information.
  • Multi-query and Grouped-query attention: To accelerate processing and save memory, smaller models (like Gemma 2B) use multi-query attention (a single key/value for all attention heads). Gemma 2 introduced grouped-query attention, where queries are divided into groups, offering a compromise between speed and quality[1][7].
  • Alternating Attention Scheme: Gemma 2 implements a scheme where layers with global self-attention alternate with layers using a limited "sliding window" attention, enabling efficient processing of long contexts[1].

Model Family and Variants

In addition to the general-purpose base models, Google has released several derivative versions of Gemma optimized for specific tasks.

  • CodeGemma: A model for generating and completing code, supporting C++, C#, Go, Java, JavaScript, Python, Rust, and other languages[1].
  • DataGemma: A family of models fine-tuned for integration with external data using RAG techniques. The model can execute search queries against databases (e.g., Google Data Commons) to improve the factual accuracy of its responses[1].
  • PaliGemma: A multimodal model capable of accepting images and text as input. It is designed for visual question-answering tasks, such as image captioning and object recognition[1].
  • RecurrentGemma: An experimental variant with a hybrid Griffin architecture, combining local attention and linear recurrent connections. This significantly speeds up the generation of long sequences[7].
  • MedGemma: A specialized version of Gemma 3 for the medical domain. It includes multimodal (4B) and text-only (27B) models for analyzing medical images (X-rays, scans) and clinical documents. The models are distributed as open models but are not intended for direct clinical use without further validation[8].
  • DolphinGemma: A research project applying Gemma technologies to decipher dolphin communication. The model is trained on years of audio recordings and is used to identify patterns in animal language[9].

Availability and Application

Gemma models are available on the Kaggle and Hugging Face platforms, and are also integrated into Google Colab and Vertex AI Model Garden services[2]. To accelerate inference, Google, in collaboration with NVIDIA, has adapted the models for TensorRT. The Gemma license terms permit commercial use and modification of the models, which distinguishes them from some other open projects. Distribution is governed by the Responsible AI License, which imposes restrictions on use in certain areas (e.g., weapons development) and requires derivative products to adhere to the principles of safe and ethical AI application[3].

Safety and Responsibility

The developers paid significant attention to safety issues, given the open nature of the models.

  • Data Filtering: During the preparation of training datasets, personal data and other sensitive information were automatically filtered to reduce the risk of leaks[2].
  • Alignment: The instruction-tuned versions of the models underwent multi-stage alignment using Supervised Fine-Tuning (SFT) and RLHF (Reinforcement Learning from Human Feedback) techniques to instill preferred response styles[1].
  • Red Teaming: Before release, the models were subjected to in-depth testing for resilience against malicious prompts. Experts attempted to provoke the generation of harmful or undesirable content to identify vulnerabilities[3].
  • Responsible AI Toolkit: Along with the models, Google released a set of tools to facilitate safe deployment, including the Gemma Debugger utility for analyzing the model's internal states and classifiers for undesirable content[2].
  • ShieldGemma: A specialized filter model designed to prevent the generation of unsafe content in the multimodal versions of Gemma[6].

Literature

  • Mesnard, T. et al. (2024). Gemma: Open Models Based on Gemini Research and Technology. arXiv:2403.08295.
  • Rivière, M. et al. (2024). Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118.
  • Kamath, A. et al. (2025). Gemma 3 Technical Report. arXiv:2503.19786.
  • Zhao, H. et al. (2024). CodeGemma: Open Code Models Based on Gemma. arXiv:2406.11409.
  • Beyer, L. et al. (2024). PaliGemma: A Versatile 3B VLM for Transfer. arXiv:2407.07726.
  • Steiner, A. et al. (2024). PaliGemma 2: A Family of Versatile VLMs for Transfer. arXiv:2412.03555.
  • Botev, A. et al. (2024). RecurrentGemma: Moving Past Transformers for Efficient Open Language Models. arXiv:2404.07839.
  • Ainslie, J. et al. (2023). GQA: Training Generalized Multi‑Query Transformer Models from Multi‑Head Checkpoints. arXiv:2305.13245.
  • Chinnakonduru, S. S. & Mohapatra, A. (2024). Weighted Grouped Query Attention in Transformers. arXiv:2407.10855.
  • Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
  • Radhakrishnan, P. et al. (2024). Knowing When to Ask — Bridging Large Language Models and Data. arXiv:2409.13741.

Notes

  1. 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 "What Is Google Gemma?". IBM. [1]
  2. 2.0 2.1 2.2 2.3 2.4 2.5 2.6 "Gemma: Google introduces new state-of-the-art open models". Google Developers Blog. [2]
  3. 3.0 3.1 3.2 "Google's open-source Gemma AI models draw from the research behind Gemini". The Verge. [3]
  4. "Google launches two new open LLMs". TechCrunch. [4]
  5. "Gemma 2: Improving Open Language Models at a Practical Size". Google.
  6. 6.0 6.1 6.2 6.3 6.4 6.5 "Google unveils open source Gemma 3 model with 128k context window". VentureBeat. [5]
  7. 7.0 7.1 7.2 7.3 "Gemma explained: An overview of Gemma model family architectures". Google Developers Blog. [6]
  8. "Google Releases MedGemma: Open AI Models for Medical Text and Image Analysis". InfoQ. [7]
  9. "Google Is Training a New A.I. Model to Decode Dolphin Chatter—and Potentially Talk Back". Smithsonian Magazine. [8]