Jamba (language model)
Jamba is a family of large language models (LLMs) developed by the Israeli research company AI21 Labs. Jamba introduces a first-of-its-kind hybrid architecture that combines key elements from two dominant approaches in AI development: Transformers and State Space Models (SSMs), specifically the Mamba architecture[1].
Jamba's primary goal is to address a fundamental trade-off in modern LLMs: the high quality and performance (characteristic of Transformers) versus the efficiency and ability to process ultra-long contexts (characteristic of SSMs). By combining these approaches and adding sparsity through Mixture-of-Experts (MoE), Jamba offers a model that is simultaneously powerful, efficient, and capable of handling vast amounts of text in a single query.
Jamba's Architecture in Detail
Jamba does not simply alternate between Transformer and Mamba layers. It employs a meticulously designed block structure, where each block consists of eight layers.
Structure of a single Jamba block:
- One Transformer Layer: This layer is responsible for "deep" understanding and complex reasoning. The Mixture-of-Experts (MoE) architecture is built into this layer.
- Seven Mamba Layers: These layers follow the Transformer layer and are responsible for efficient sequence processing and propagating information across a long context[2].
This asymmetric structure allows the model to manage computational resources efficiently: the heavy but powerful Transformer operations are performed less frequently, while the lightweight and fast Mamba operations are performed more often.
Mixture-of-Experts (MoE) Integration
Jamba utilizes the MoE architecture to further enhance its efficiency.
- MoE is applied only to the feed-forward network (FFN) blocks within the Transformer layers[3]. The Mamba layers remain dense.
- The first Jamba model had 16 experts.
- For each token, a router network selects the top 2 experts (Top-2 gating).
This means that while the model's total parameter count is large (52 billion), only 2 out of the 16 experts are active during each token processing step in a Transformer layer, making the computation very fast.
Evolution of Jamba Models
Jamba-v0.1 (March 2024)
The first model introduced in this family has the following specifications:
| Specification | Value |
|---|---|
| Total Parameters | 52 billion |
| Active Parameters | ~12 billion |
| Number of Experts (MoE) | 16 (2 active) |
| Context Window | 256,000 tokens |
| License | Apache 2.0[4] |
Thanks to its hybrid architecture, Jamba-1 can process a context length of 256,000 tokens, equivalent to an approximately 400-page novel, and can be deployed on a single consumer GPU with 80 GB of memory[5].
Jamba-1.5 (2024)
In 2024, AI21 Labs introduced the updated Jamba 1.5 family of models, which includes two versions: Jamba 1.5 Mini (12B active parameters out of 52B total) and Jamba 1.5 Large (94B active parameters out of 398B total)[6]. These models demonstrate significant performance improvements:
- Up to 2.5 times faster inference on long contexts compared to competitors.
- Support for nine languages, including English, Spanish, French, and Arabic[7].
Key Advantages and Performance
- Massive Context Window: 256,000 tokens—one of the largest context windows among all available models (including proprietary ones) at the time of its release. This makes Jamba ideal for tasks requiring the analysis of large documents, such as legal contracts, scientific papers, entire codebases, or long dialogues.
- High Performance and Efficiency: In benchmarks, Jamba demonstrates performance comparable to or exceeding that of leading open models of a similar size, such as Llama and Mixtral, while achieving 3 times higher throughput on long contexts.
- Openness and Accessibility: Jamba is distributed under the permissive Apache 2.0 license, allowing for free use in commercial and research applications. The model weights are available on the Hugging Face platform.
Benchmark Results
Jamba 1.5 shows competitive results on various benchmarks:
- Jamba 1.5 Mini scored 46.1 on Arena Hard, making it the leading public model in its category[8].
- Jamba 1.5 Large scored 65.4 on Arena Hard, outperforming Llama 3.1 70B and 405B.
Applications and Availability
Jamba is optimized for business applications and supports capabilities such as function calling, structured JSON output, and document processing. The model is available on multiple platforms, including:
- Hugging Face
- Google Cloud Vertex AI
- Microsoft Azure
- NVIDIA API catalog
- Amazon Bedrock
- AI21 Studio
To support cost-effective inference, AI21 Labs introduced ExpertsInt8, a new quantization technique that allows Jamba 1.5 Large to be hosted on a machine with eight 80GB GPUs without quality loss when processing a 256K token context[9].
Further Reading
- Lieber, O.; et al. (2024). Jamba: A Hybrid Transformer‑Mamba Language Model. arXiv:2403.19887.
- Lieber, O.; et al. (2024). Jamba‑1.5 Models and ExpertsInt8 Quantization. OpenReview JFPaD7lpBD.
- Gu, A.; Dao, T. (2023). Mamba: Linear‑Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752.
- Gu, A.; et al. (2021). S4: Efficiently Modeling Long Sequences with Structured State Spaces. arXiv:2111.00396.
- Fedus, W.; et al. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv:2101.03961.
- Yun, L.; et al. (2024). Toward Inference‑Optimal Mixture‑of‑Expert Large Language Models. arXiv:2404.02852.
- Liu, J.; et al. (2024). A Survey on Mixture of Experts in Large Language Models. arXiv:2407.06204.
- Gupta, V.; et al. (2024). Lynx: Enabling Efficient MoE Inference through Dynamic Batch‑Aware Expert Selection. arXiv:2411.08982.
- Liu, J.; et al. (2024). A Survey on Inference Optimization Techniques for Mixture of Experts Models. arXiv:2412.14219.
- Hsieh, C.‑P.; et al. (2024). RULER: What's the Real Context Size of Your Long‑Context Language Models?. arXiv:2404.06654.
References
- ↑ “Announcing Jamba: AI21's Groundbreaking SSM-Transformer Model”. AI21 Labs Blog. [1]
- ↑ Lieber, O., et al. (2024). Jamba: A Hybrid Transformer-Mamba Language Model. arXiv:2403.19887.
- ↑ “Jamba Documentation”. Hugging Face Transformers. [2]
- ↑ “ai21labs/Jamba-v0.1”. Hugging Face. [3]
- ↑ “AI21 Labs' Jamba: A New Hybrid LLM Architecture”. Gradient Flow. [4]
- ↑ “Announcing the Jamba-1.5 model family”. AI21 Labs Blog. [5]
- ↑ “ai21labs/AI21-Jamba-Large-1.5”. Hugging Face. [6]
- ↑ “Jamba-1.5 family of models by AI21 Labs is now available in Amazon Bedrock”. AWS What's New. [7]
- ↑ “ExpertsInt8: A new paradigm for efficient inference of MoE-based LLMs”. OpenReview. [8]