LLM cost optimization
Large Language Model (LLM) cost optimization is a set of strategies and technical methods aimed at reducing the computational and financial resources required for training, fine-tuning, and, especially, inference of large language models. The relevance of this field is driven by the enormous cost of both developing and operating LLMs.
For example, training the GPT-3 model with 175 billion parameters was estimated to cost around $4.6 million on cloud GPU infrastructure[1] and required 1.3 million kWh of electricity[2]. However, the primary costs often arise during the inference stage. It is estimated that the daily operational costs to support the ChatGPT service in early 2023 were approximately $700,000 (about $0.0036 per query), which far exceeds the one-time training costs[3].
Optimization During Training and Model Selection
Effective cost management begins with fundamental decisions made before the inference stage.
Scaling Laws: Model Size vs. Data Volume
One of the key breakthroughs in understanding the economics of LLM training was the Chinchilla scaling laws, introduced by DeepMind researchers in 2022. They showed that for optimal use of the computational budget, a model should be trained on a significantly larger volume of data than was previously done[4].
Historically, it was assumed that performance grew mainly by increasing the number of parameters. However, the Chinchilla study demonstrated that the Chinchilla model (70 billion parameters), trained on 1.4 trillion tokens, outperforms the much larger GPT-3 model (175 billion parameters), which was trained on only ~300 billion tokens[5]. The recommended ratio is approximately 20 tokens of training data for each model parameter. This approach allows for the creation of more compact and efficient models, reducing both training and subsequent inference costs.
Fine-tuning and Its Efficiency
Instead of costly training from scratch, it is becoming increasingly common to fine-tune existing open-source models (e.g., the LLaMA or Falcon families). To further reduce costs, methods of Parameter-Efficient Fine-Tuning (PEFT) are applied.
The most popular method, LoRA (Low-Rank Adaptation), allows the model to be adapted by updating only a small number of additional parameters. Studies show that LoRA can reduce fine-tuning costs by tens of percent (up to ~68% in some scenarios) with a negligible impact on quality[6].
Model Compression
A crucial area of optimization is reducing the physical size of the model while preserving its performance.
Knowledge Distillation
Knowledge distillation is a process in which a large and powerful "teacher" model is used to train a more compact "student" model. The student learns to mimic the teacher's responses on a broad dataset, thereby inheriting its "knowledge." This method allows for achieving comparable quality on specific tasks at a significantly lower cost. For example, the DeepSeek-R1 model was successfully distilled from 671 billion to 70 billion and even 1.5 billion parameters with an acceptable loss of quality for many applications[7].
Quantization
Quantization is the process of reducing the numerical precision used to represent the model's weights. Instead of standard 32-bit or 16-bit floating-point numbers, 8-bit or even 4-bit integers are used.
- 8-bit quantization reduces the model size by approximately 50% with a precision loss of about 1%.
- 4-bit quantization reduces the model size by 75% while maintaining competitive output quality[7].
With hardware support (e.g., in modern GPUs from Nvidia) and software libraries (e.g., TensorRT), quantization can speed up inference by 2–4 times[8].
Optimization at the Inference Stage
Once the model is trained and deployed, the majority of costs are related to its day-to-day use.
Request Batching
Batching is the process of combining multiple user requests into a single "batch" for simultaneous processing on a GPU. This significantly increases hardware utilization and overall throughput. For LLMs, where responses are generated one token at a time, the most effective method is continuous batching (or in-flight batching). This method allows new requests to be dynamically added to the batch as other requests in it are completed, which eliminates idle time and maximizes GPU load[9].
Key-Value (KV) Caching
In Transformer models, generating each new token requires information about all preceding tokens. To avoid an exponential increase in computations, Key-Value Caching (KV Cache) is used. The system stores the intermediate results of the attention mechanism's calculations for the already processed context and reuses them, making the generation of long sequences and multi-turn dialogues significantly more efficient[7].
Attention Mechanism Optimization
Storing the KV cache requires a significant amount of memory. To reduce it, optimized variants of the attention mechanism have been developed:
- Multi-Query Attention (MQA): All attention heads share a single set of keys and values.
- Grouped-Query Attention (GQA): An intermediate compromise where attention heads are divided into groups, and each group shares a common set of keys and values.
Meta successfully applied GQA in the LLaMA 2 models, which significantly increased inference efficiency when working with long contexts without a substantial loss in quality[10].
Infrastructure and System Architecture Optimization
Hybrid Systems and Retrieval-Augmented Generation (RAG)
The largest and most powerful model is not always required for a given task. A hybrid or cascading approach involves using a small, inexpensive model for simple requests, and only if it fails or for complex tasks is the request rerouted to a large, expensive model.
A specific and highly effective case of this approach is Retrieval-Augmented Generation (RAG). In this architecture, the LLM can be relatively compact, as it uses up-to-date information retrieved from an external knowledge base (e.g., corporate documentation or a search engine) to formulate its response. This not only reduces the requirements for the model's size but also solves the problem of hallucinations. Deploying a specialized 70-billion-parameter model with RAG on-premises can be 2–4 times cheaper than using the GPT-4 API in the cloud[11].
References
- ↑ "OpenAI's GPT-3 Language Model: A Technical Overview". Lambda Labs. [1]
- ↑ "The Energy Footprint of Humans and Large Language Models". Communications of the ACM. [2]
- ↑ "The Inference Cost Of Search Disruption - Large Language Model Cost Analysis". SemiAnalysis. [3]
- ↑ Hoffmann, J., et al. (2022). "Training Compute-Optimal Large Language Models". arXiv:2203.15556.
- ↑ Chow, T. (2024). "Three Kuhnian Revolutions in ML Training". Substack. [4]
- ↑ "A Study to Evaluate the Impact of LoRA Fine-tuning on the Performance of Non-functional Requirements Classification". arXiv:2503.07927. (2025).
- ↑ 7.0 7.1 7.2 "LLM Inference Optimization: How to Speed Up, Cut Costs, and Scale AI Models". deepsense.ai. [5]
- ↑ Jin, H., et al. (2024). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers".
- ↑ "Continuous vs dynamic batching for AI inference". Baseten Blog. [6]
- ↑ "What is grouped query attention?". IBM. [7]
- ↑ "Inferencing on-premises with Dell Technologies". Dell Technologies Analyst Paper. [8]