Mixtral (Mistral AI)
Mixtral 8x7B is an open-source large language model (LLM) developed by the French company Mistral AI and released in December 2023. The model is based on the Sparse Mixture of Experts (SMoE) architecture, which allows it to achieve performance comparable to much larger models (such as Llama 2 70B and GPT-3.5) while maintaining high speed and efficiency during inference[1].
The model is distributed under the open Apache 2.0 license, making it available for academic and commercial use. Mixtral 8x7B demonstrates strong capabilities in multilingual tasks, code generation, and instruction following, which made it one of the most popular open-source models at the time of its release[2].
Development History
Mistral AI was founded in April 2023 by former researchers from Meta and Google. In September 2023, the company released its first model, Mistral 7B, which gained recognition for its high efficiency despite its small size.
On December 11, 2023, Mistral AI announced the release of Mixtral 8x7B, its first model based on the Mixture of Experts architecture. The model immediately captured the community's attention as the most powerful open-source LLM at the time, demonstrating quality on par with GPT-3.5 but with significantly faster inference speeds. In January 2024, a detailed technical description of the model was published as a scientific paper on arXiv, allowing independent researchers to review the architectural details and test results[2].
Architecture: Sparse Mixture of Experts (SMoE)
The main innovation in Mixtral 8x7B is the implementation of the Sparse Mixture of Experts architecture. Unlike standard ("dense") transformers, where each layer performs the same computation for all tokens, in Mixtral, each layer contains several parallel "expert" blocks.
Key architectural features:
- MoE Structure: Each transformer layer contains 8 feed-forward blocks ("experts"). For processing each token, a special router network selects the 2 most suitable experts (Top-2 routing).
- Parameters: The total number of parameters in the model is 46.7 billion. However, due to sparse activation, only 12.9 billion active parameters are used for each token during inference. This results in an inference speed comparable to models with ~13 billion parameters.
- Attention Optimization: The model uses modern techniques for efficiently processing long sequences, including Sliding Window Attention (SWA) and Grouped Query Attention (GQA).
- Context Length: The model supports a context window of up to 32,768 tokens.
Training
The Mixtral 8x7B family includes two main versions: 1. Mixtral-8x7B-v0.1 (base model): A pre-trained model trained on a large corpus of web data in several European languages (English, French, German, Spanish, Italian). Its primary task is next-token prediction. 2. Mixtral-8x7B-Instruct-v0.1 (instruct model): A version fine-tuned using supervised fine-tuning (SFT) and Direct Preference Optimization (DPO). This model is better at following user instructions and is designed for conversational use.
Performance
Mixtral 8x7B outperforms or is on par with Llama 2 70B on most standard benchmarks, while having 5 times fewer active parameters and, consequently, a significantly higher inference speed (up to 6 times faster)[2].
| Metric | Llama 2 70B | GPT-3.5 | Mixtral 8x7B |
|---|---|---|---|
| MMLU (general knowledge) | 69.9% | 70.0% | 70.6% |
| GSM-8K (mathematics) | 53.6% | 57.1% | 58.4% |
| MBPP (code generation) | 49.8% | 52.2% | 60.7% |
| MT-Bench (dialogue evaluation, Instruct versions) | 6.86 | 8.32 | 8.30 |
- Multilinguality: Due to an increased proportion of multilingual data in its training corpus, Mixtral significantly outperforms Llama 2 70B in tasks involving French, German, Spanish, and Italian.
- Bias and Hallucinations: Compared to Llama 2 70B, the model demonstrates higher accuracy on the BBQ benchmark (evaluating social biases) and a more positive sentiment profile on the BOLD benchmark.
Licensing and Availability
Both versions of Mixtral 8x7B (base and Instruct) are released under the Apache 2.0 license, which permits free academic and commercial use. The source code and model weights are available on GitHub and Hugging Face.
Links
- Mixtral of Experts — Official announcement on the Mistral AI blog
- Mixtral 8x7B model on Hugging Face
Literature
- Jiang, A. Q.; Sablayrolles, A.; Roux, A.; et al. (2024). Mixtral of Experts. arXiv:2401.04088.
- Shazeer, N.; et al. (2017). Outrageously Large Neural Networks: The Sparsely‑Gated Mixture‑of‑Experts Layer. arXiv:1701.06538.
- Lepikhin, D.; et al. (2020). GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv:2006.16668.
- Fedus, W.; et al. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv:2101.03961.
- Ainslie, J.; et al. (2023). GQA: Training Generalized Multi‑Query Transformer Models from Multi‑Head Checkpoints. arXiv:2305.13245.
- Beltagy, I.; Peters, M. E.; Cohan, A. (2020). Longformer: The Long‑Document Transformer. arXiv:2004.05150.
- Dao, T.; et al. (2022). FlashAttention: Fast and Memory‑Efficient Exact Attention with IO‑Awareness. arXiv:2205.14135.
- Cai, W.; et al. (2025). A Survey on Mixture of Experts in Large Language Models. arXiv:2407.06204.
- Yun, L.; et al. (2024). Toward Inference‑Optimal Mixture‑of‑Expert Large Language Models. arXiv:2404.02852.
- Huang, B.; et al. (2024). Toward Efficient Inference for Mixture of Experts. OpenReview: stXtBqyTWX.