DBRX (language model)

From Systems Analysis Wiki
Jump to navigation Jump to search

DBRX is an open-source large language model (LLM) developed by the Mosaic AI research team at Databricks. The model was officially released on March 27, 2024, and is positioned as a high-performance solution for enterprise use[1].

DBRX is built on a fine-grained Mixture-of-Experts (MoE) architecture, combining high performance with training and inference efficiency. At the time of its release, DBRX demonstrated state-of-the-art results among open-source models on key benchmarks, outperforming models like LLaMA 2, Mixtral, and Grok-1, and showing competitiveness with closed-source models such as GPT-3.5 Turbo[2].

Development History

The release of DBRX was a continuation of Databricks' strategy to develop open generative models. In June 2023, Databricks acquired the startup MosaicML, which specialized in training large models, and established the Mosaic AI division based on it[3].

The Mosaic AI team, led by lead neural network architect Jonathan Frankle, began developing a new large LLM with the goal of achieving quality comparable to the best proprietary systems, but in an open-source format. The project was named DBRX. The development and pre-training of the model took approximately 2.5 months and cost an estimated $10 million[3].

Architecture

DBRX is a decoder-only transformer model and implements a fine-grained Mixture-of-Experts (MoE) architecture.

Key architectural features:

  • Total parameters: 132 billion.
  • Experts: The model consists of 16 smaller, specialized sub-models ("experts").
  • Activation mechanism: For each input token, only 4 out of the 16 experts are activated. This means that only 36 billion parameters are active during inference, ensuring high speed and efficiency. This setup provides 65 times more possible combinations of experts compared to the Mixtral model (which has 8 experts with 2 activated)[1].
  • Components: It utilizes modern architectural solutions such as Rotary Position Embeddings (RoPE), Gated Linear Units (GLU), and Grouped-Query Attention (GQA).
  • Context length: 32,768 tokens.

This architecture allows the model to combine the advantages of a vast number of parameters (for knowledge storage) with the efficiency of smaller models (for inference speed).

Training

DBRX was pre-trained on a carefully curated dataset of 12 trillion tokens, consisting of text and code. Data quality was a key priority: the developers used the Databricks cloud platform (Apache Spark, Databricks Notebooks, Unity Catalog) for data cleaning, preparation, and auditing[1].

The training process employed curriculum learning, where the proportion of data types was varied at different stages. For example, the final part of the training involved the gradual introduction of complex tasks, which, according to the developers, resulted in a significant quality improvement. The training was conducted on a cluster of 3,072 Nvidia H100 GPUs.

After pre-training, the base model underwent additional fine-tuning (instruction tuning) to create the interactive version, DBRX Instruct, which is optimized for following user instructions.

Performance

At the time of its release, DBRX set a new standard for quality among open-source LLMs across a wide range of benchmarks.

Comparison with Open-Source Models

DBRX Instruct results on key benchmarks[1]
Benchmark Task DBRX Instruct Next Best (Mixtral/Grok-1)
Hugging Face Open LLM Leaderboard (AVG) General Knowledge 74.5% 72.7% (Mixtral Instruct)
HumanEval Programming 70.1% 63.2% (Grok-1)
GSM8K Mathematical Reasoning 66.9% 62.9% (Grok-1)
MMLU General Knowledge 73.7% 71.5% (Mixtral Instruct)

DBRX took first place in both the overall Hugging Face Open LLM Leaderboard and the comprehensive Databricks LLM Gauntlet test, demonstrating a significant lead over its predecessors[1].

Comparison with Closed-Source Models

DBRX Instruct surpasses GPT-3.5 Turbo on several key metrics, including MMLU (73.7% vs. 70.0%) and HumanEval (70.1% vs. 48.1%). In terms of response quality on some benchmarks (e.g., MTBench), the model approaches the level of Gemini 1.0 Pro and early versions of GPT-4[1].

Training and Inference Efficiency

  • Training efficiency: The use of the MoE architecture reduced the required FLOPS by a factor of 2-4 compared to dense models of similar quality.
  • Inference efficiency: By activating only 36 billion parameters, DBRX provides 2-3 times higher throughput (inference speed) compared to dense models of an equivalent size (e.g., LLaMA2-70B)[1].

Licensing and Availability

DBRX is distributed under the custom-developed Databricks Open Model License. It permits free use and modification, including for commercial purposes, but contains several restrictions. Specifically, like the LLaMA 2 license, it requires obtaining a separate permission from Databricks if services based on DBRX are used by an audience exceeding 700 million monthly active users.

The pre-trained model weights (for both the base and Instruct versions) are available for download from the Hugging Face repository[4].

Further Reading

  • Mosaic Research Team. (2024). Introducing DBRX: A New State‑of‑the‑Art Open LLM. Databricks Blog.
  • Databricks. (2024). Databricks Open Model License (DBRX). Online specification.
  • Fedus, W.; et al. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv:2101.03961.
  • Lepikhin, D.; et al. (2020). GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv:2006.16668.
  • Ainslie, J.; et al. (2023). Grouped‑Query Attention: Efficient Training of Generalized Multi‑Query Transformers. arXiv:2305.13245.
  • Su, J.; et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
  • Dao, T. (2023). FlashAttention‑2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691.
  • Cai, W.; et al. (2024). A Survey on Mixture of Experts in Large Language Models. arXiv:2407.06204.
  • Chen, Y.; et al. (2024). Scaling Laws for Fine‑Grained Mixture of Experts. arXiv:2402.07871.
  • Kundu, A.; et al. (2024). Strategic Data Ordering: Enhancing Large Language Model Training via Curriculum Learning. arXiv:2405.07490.

References

  1. 1.0 1.1 1.2 1.3 1.4 1.5 1.6 “Introducing DBRX: A New State-of-the-Art Open LLM”. Databricks Blog. [1]
  2. “Databricks' open-source DBRX LLM beats Llama 2, Mixtral, and Grok”. InfoWorld. [2]
  3. 3.0 3.1 “Databricks spent $10M on new DBRX generative AI model”. TechCrunch. [3]
  4. “databricks/dbrx-base”. Hugging Face. [4]