Low-Rank Adaptation (LoRA)

From Systems Analysis Wiki
Jump to navigation Jump to search

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method that allows adapting large language models (LLMs) to new tasks with minimal computational costs. The technique was first introduced in a paper by Edward Hu and his colleagues in 2021[1].

Full fine-tuning of large models, such as LLaMA or GPT, requires enormous resources, making it inaccessible for most researchers and developers. LoRA solves this problem by allowing only a small fraction of the model's parameters to be fine-tuned, while maintaining high quality and performance comparable to full fine-tuning[2].

How It Works

The core idea of LoRA is not to modify the original weights of the pre-trained model but to add a small "corrective" matrix to them. Instead of directly training the huge weight matrix `W`, LoRA represents its update as the product of two small, low-rank matrices.

Formally, if the original weight matrix of a layer `W_0` has dimensions `d × k`, its update is represented as `ΔW = BA`, where `B` is a matrix of size `d × r` and `A` is a matrix of size `r × k`. The rank `r` is a hyperparameter and is significantly smaller (`r << d, k`). During fine-tuning, the original weights `W_0` are frozen, and only the matrices `A` and `B` are trained. The final weight matrix is calculated as `W = W_0 + BA`.

This allows for a reduction in the number of trainable parameters by thousands of times. For example, when fine-tuning GPT-3 (175 billion parameters), LoRA reduces the number of trainable parameters by a factor of 10,000 and decreases GPU memory requirements by a factor of 3[1].

Key Advantages

  • Resource Efficiency: The number of trainable parameters is drastically reduced (by up to 90% or more), which significantly decreases video memory (VRAM) consumption and speeds up the training process.
  • No Inference Latency: After training, the matrices `B` and `A` can be "merged" with the main matrix `W_0` by calculating `W = W_0 + BA`. Thus, no additional computations or latency are introduced during model inference[1].
  • Modularity and Fast Task-Switching: Trained LoRA adapters are small files (a few megabytes). This makes it easy to store dozens of adapters for different tasks and quickly switch between them without changing the base model[3].

Limitations and Modifications

Although LoRA is very effective, its low-rank nature can be a limitation for tasks that require memorizing a large amount of new information. To address this and other issues, various modifications have been proposed.

QLoRA

QLoRA (Quantized Low-Rank Adaptation) is one of the most popular modifications, proposed in 2023. It combines LoRA with 4-bit quantization of the base model[4]. This further reduces memory requirements, making it possible to fine-tune models with tens of billions of parameters (e.g., a 65B model) on a single consumer-grade GPU. The Guanaco model, for instance, was created based on QLoRA and demonstrated performance comparable to ChatGPT.

Other Modifications

  • MoRA (High-Rank Updating): Proposed for tasks where LoRA shows insufficient performance due to its rank limitation. MoRA uses methods that allow for high-rank weight updates while maintaining parameter efficiency[5].

Implementation and Application

The LoRA technique has gained widespread adoption due to its efficiency and ease of integration. The PEFT (Parameter-Efficient Fine-Tuning) library from Hugging Face played a key role in its popularization. PEFT provides a unified interface for applying LoRA and other PEFT methods to models from the Transformers ecosystem[6].

LoRA is actively used for:

  • Adapting chatbots and conversational systems (e.g., fine-tuning LLaMA, Mistral).
  • Creating models for text classification and generation in specialized domains.
  • Personalizing models for a specific style or data format.

Literature

  • Hu, E.J. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
  • Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314.
  • Zhang, Q. et al. (2023). AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. arXiv:2303.10512.
  • Chen, Y. et al. (2023). LongLoRA: Efficient Fine-Tuning of Long-Context Large Language Models. arXiv:2309.12307.
  • Mao, K. et al. (2024). A Survey on LoRA of Large Language Models. arXiv:2407.11046.
  • Jiang, T. et al. (2024). MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning. arXiv:2405.12130.
  • Liu, Z. et al. (2024). ALoRA: Allocating Low-Rank Adaptation for Fine-Tuning Large Language Models. arXiv:2403.16187.
  • Liu, J. et al. (2025). RoRA: Efficient Fine-Tuning of LLM with Reliability Optimization for Rank Adaptation. arXiv:2501.04315.
  • Albert, P. et al. (2025). RandLoRA: Full-Rank Parameter-Efficient Fine-Tuning of Large Models. arXiv:2502.00987.
  • Tastan, N. et al. (2025). LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning. arXiv:2505.21289.

References

  1. 1.0 1.1 1.2 Hu, E.J., et al. "LoRA: Low-Rank Adaptation of Large Language Models". arXiv:2106.09685. [1]
  2. Mao, K., et al. "A Survey on LoRA of Large Language Models". arXiv:2407.11046. [2]
  3. Noble, Joshua. "What is LoRA (Low-Rank Adaption)?". IBM Technology. [3]
  4. Dettmers, T., et al. "QLoRA: Efficient Finetuning of Quantized LLMs". arXiv:2305.14314. [4]
  5. Jiang, Z., et al. "MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning". arXiv:2405.12130. [5]
  6. "LoRA (Low-Rank Adaptation)". Hugging Face LLM Course. [6]