Jais (language model)
Jais (pronounced "Jice") is a family of open-source large language models (LLMs) developed in the United Arab Emirates and specifically optimized for the Arabic language[1]. The model is named after Jebel Jais, the highest peak in the UAE[2].
The project is a collaboration between the research company Inception (a subsidiary of the technology conglomerate G42), Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), and the California-based AI chip manufacturer Cerebras Systems[2]. Jais was released under an open-source license to foster the development of an AI ecosystem for the Arabic language, preserving cultural and linguistic heritage and making modern AI technologies more accessible to the Arabic-speaking world[1].
Development History and Releases
The Jais project was initiated in 2023 to address the limitations of existing LLMs for low-resource languages. The developers noted a lack of high-quality bilingual models capable of processing both Arabic and English with equal proficiency[2].
Jais-13B: The First Version
The first version, Jais-13B, was released on August 30, 2023, and contained 13 billion parameters[1]. The model was trained on a mixed corpus of English and Arabic texts totaling 395 billion tokens[3]. At the time of its release, it was described as "the highest-quality Arabic LLM"[1].
Jais-30B: Scaling Up
On November 8, 2023, less than three months later, the consortium introduced a second, significantly improved version—Jais-30B with 30 billion parameters[4]. The increase in scale was driven by the need to solve more complex practical tasks, such as summarization and translation. The model was trained on an expanded and cleaned dataset of 1.63 trillion tokens[4].
Jais-70B and the Model Family
On August 6, 2024, Inception (G42) announced the launch of the flagship model Jais-70B (70 billion parameters) and an entire family of related models[5]. Jais-70B became the largest open-source LLM focused on the Arabic language. Its development employed the continuous training method: instead of training from scratch, the model was based on Meta's Llama 2 70B, which was then further trained on 330 billion tokens of Arabic text. This approach allowed for the effective transfer of English language knowledge from Llama 2 and concentrated resources on training for Arabic proficiency[5].
Architecture and Technical Features
Jais is an autoregressive transformer model based on the GPT-3 architecture (decoder-only). The model's key feature is its bilingual specialization in Arabic and English, unlike many multilingual LLMs where English is predominant. This enables it to achieve a deep understanding of the Arabic language and its dialects[3].
Advanced technical solutions were integrated into Jais's design[3]:
- ALiBi positional embeddings: A special positional embedding scheme that allows the model to process longer contexts than it was trained on.
- SwiGLU activation: An activation function that improves training quality and the expressiveness of neural layers.
- Maximal Update Parametrization (µP): A hyperparameter tuning technique that stabilizes training as model size increases.
- Specialized tokenizer: Developed to account for the specifics of Arabic and English, reducing the number of tokens for Arabic text by 3-4 times compared to general-purpose tokenizers and increasing processing speed[6].
In addition to the base (foundation) models, a Jais-chat version was released, which was further fine-tuned on 9.6 million question-answer pairs to adapt it for chatbot and assistant tasks[3].
Training and Dataset
One of the project's main challenges was preparing a high-quality, large-scale corpus of Arabic texts. The final training dataset for Jais-13B consisted of 395 billion tokens, of which:
- 116 billion tokens (29%) were Arabic text.
- 279 billion tokens (71%) were English text and programming code.
The Arabic component was intentionally made significant (around 30%) to ensure high proficiency in the language[3]. The data included books, news articles, web pages, and source code. To increase the volume of high-quality Arabic texts, machine translation of English-language resources was used[3].
The models were trained on the Condor Galaxy 1 (CG-1) supercomputer in Abu Dhabi, which was jointly developed by G42 and Cerebras Systems. Thanks to this infrastructure, training Jais-13B took only about 3.5 days of net time[2].
Application and Significance
Jais is positioned as a key step in the development of generative AI for the Arabic language and for other language communities underrepresented in modern LLMs. Open access to the model is intended to stimulate the adoption of natural language processing technologies in the Middle East and North Africa regions.
Since its launch, the project has attracted interest from government and commercial entities in the UAE. Early access to the model was granted to the UAE Ministry of Foreign Affairs, the oil and gas company ADNOC, Etihad Airways, and the First Abu Dhabi Bank[1]. In 2024, Microsoft announced the integration of Jais into its Microsoft Azure cloud platform, making it available to global users[6].
The creators of Jais emphasize its role in preserving Arabic cultural and linguistic heritage. According to Inception CEO Andrew Jackson, the project aims to "ensure that the Arabic language, with its rich heritage, finds its voice in the AI landscape"[1]. The experience gained from this project is planned to be used to create similar LLMs for other languages and cultures[1].
Literature
- Shazeer, N.; et al. (2020). GLU Variants Improve Transformer. arXiv:2002.05202.
- Press, O.; et al. (2021). Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. arXiv:2108.12409.
- Yang, G.; et al. (2022). Tensor Programs V: Tuning Large Neural Networks via Zero‑Shot Hyperparameter Transfer. arXiv:2203.03466.
- Ali, A. R.; et al. (2022). A Large and Diverse Arabic Corpus for Language Modeling. arXiv:2201.09227.
- Sengupta, N.; et al. (2023). Jais and Jais‑chat: Arabic‑Centric Foundation and Instruction‑Tuned Open Generative Large Language Models. arXiv:2308.16149.
- Inception AI (2024). JAIS 30B Whitepaper. Online whitepaper.
- Koto, F.; et al. (2024). ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic. arXiv:2402.12840.
- Qian, Z.; et al. (2024). CamelEval: Advancing Culturally Aligned Arabic Language Models and Benchmarks. arXiv:2409.12623.
- Blake, C.; et al. (2024). u‑μP: The Unit‑Scaled Maximal Update Parametrization. arXiv:2407.17465.
- Inception AI; MBZUAI; Cerebras Systems (2024). Jais Family Model Card. Hugging Face.
Notes
- ↑ 1.0 1.1 1.2 1.3 1.4 1.5 1.6 "Meet 'Jais', The World's Most Advanced Arabic Large Language Model Open Sourced by G42's Inception". Cerebras Systems. [1]
- ↑ 2.0 2.1 2.2 2.3 "UAE's G42 launches open source Arabic language AI model". Reuters. [2]
- ↑ 3.0 3.1 3.2 3.3 3.4 3.5 "[2308.16149] Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models". arXiv. [3]
- ↑ 4.0 4.1 "Upgraded Arabic large language model is twice as big". Computer Weekly. [4]
- ↑ 5.0 5.1 "G42 launches JAIS 70B and 20 other AI models to advance Arabic natural language processing". Abu Dhabi Media Office. [5]
- ↑ 6.0 6.1 "Introducing JAIS: Arabic-centric Large Language Model on Azure". Microsoft Tech Community. [6]