Prompt compression
Prompt compression is a set of methods in prompt engineering aimed at reducing the length of the input text (prompt) for large language models (LLMs) while preserving key information[1]. As the context window of LLMs has grown to millions of tokens (e.g., in Google Gemini), the ability to process very long texts has emerged, but this has created new challenges: high inference costs, increased latency, and a decline in reasoning quality due to the "lost in the middle" effect[2].
Prompt compression addresses these problems by concentrating the most essential data into a shortened input and discarding redundant information. This reduces the risk of exceeding the context limit, accelerates generation, and lowers costs, all while maintaining the accuracy of the responses[3].
Methods of Prompt Compression
Prompt compression methods can be divided into several main classes.
Token Dropping (Filtering)
This approach involves removing the least informative tokens, phrases, or sentences from the source text without altering the remaining parts. The importance of tokens is determined heuristically.
- LLMLingua: A method developed by Microsoft that calculates the perplexity of each token and removes those that have little impact on the text's predictability. In the LongLLMLingua version, this approach is adapted for long documents, taking into account the relevance of segments to a specific user query[4].
- Selective-Context: Uses a small language model to evaluate the self-information of each token and discards the tokens with the lowest informativeness[5].
- PCRL (Prompt Compression via Reinforcement Learning): Trains an agent using reinforcement learning to make a "keep" or "drop" decision for each token, with the goal of maximizing a quality metric (e.g., ROUGE) for the final response[6].
Abstractive Compression (Summarization)
In this approach, a compressor model (usually a smaller one) generates a concise, abstractive summary of the source text, which is then passed to the main LLM.
- RECOMP (Retrieval-Compression-Prompting): For each document in a knowledge base, a brief summary (summary) is pre-generated, taking into account potential user queries (a query-aware summary). This allows not only for compression but also for pre-processing of information[7].
- PRCA (Prompt Compression with Reinforced Context Aggregation): Combines training a summarizer model with reinforcement learning to generate summaries that maximally improve the quality of the main LLM's responses[8].
- Prompt-SAW (Semantic Aware Winnowing): Before summarization, it extracts a knowledge graph (entities and relations) from the text, selects the relevant graph nodes, and generates a compressed text based on them[9].
Extractive Compression
This method extracts key fragments (sentences, paragraphs) from the source text without paraphrasing them.
- Reranker-LLMs: Uses a reranker model (reranker) that assesses the importance of each paragraph or document for the current query and selects only the most relevant ones[10].
- CompAct: Demonstrates iterative extraction-summarization. The model sequentially takes segments of a long text, compresses them, and checks if there is enough information for a response. If not, it adds the next segment and compresses again, achieving significant compression while preserving quality[11].
Distillation and "Memory Tokens"
A new class of methods where, instead of text, the model receives specially trained placeholder tokens or embeddings that contain compressed information.
- Gist Tokens: An LLM is fine-tuned to "distill" long instructions into a small set of special gist tokens (e.g., 20-30 tokens instead of several thousand). These tokens are then used in place of the original prompt, providing up to 26x compression with minimal loss in quality[12].
- Soft Prompt Tuning: Instead of a text-based prompt, trainable "virtual tokens" (embeddings) are used, which are tuned to solve a specific task.
- SelfCP: Proposes using the frozen LLM itself as the compressor. By feeding it a text segment with special markers, the model generates a dense representation (memory tokens), which is then used by the same model to generate a response[13].
Efficiency and Trade-offs
- Acceleration and Cost Reduction: Since the complexity of a transformer grows quadratically ($O(n^2)$) with sequence length, reducing the prompt severalfold yields significant savings. For example, gist tokens with 26x compression demonstrate up to 40% savings in FLOPs[12].
- Quality Improvement: Sometimes, prompt compression can even improve the quality of responses if the original text contained noise or distracting details. Removing irrelevant context helps the model focus better on the important aspects of the task.
- Quality Trade-off (Faithfulness): Overly aggressive compression can lead to the loss of important details (dates, names, negations), which degrades the response quality. Abstractive methods are particularly susceptible to the risk of hallucinations. Ensuring the completeness and accuracy (faithfulness) of the compressed prompt is a key challenge.
Relation to Other Areas
- Retrieval-Augmented Generation (RAG): RAG and prompt compression are closely related. RAG can be seen as an external compression step: instead of processing an entire database, relevant documents are retrieved and selected. Prompt compression complements RAG by reducing the volume of the retrieved documents before they are fed into the LLM.
- In-Context Learning: In-context examples (demonstrations) significantly increase the prompt length. Compressing these demonstrations (e.g., using Instruction Distillation, where multiple examples are replaced by a single short instruction) is an active area of research.
Literature
- Ali, M. et al. (2024). Prompt-SAW: Semantic-Aware Winnowing for Prompt Compression. arXiv:2403.00000.
- Gao, J.; Cao, Z.; Li, W. (2024). SelfCP: Compressing Over-Limit Prompt via the Frozen Large Language Model Itself. arXiv:2405.17052.
- Jiang, H. et al. (2023). LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. arXiv:2310.05736.
- Jiang, H. et al. (2023). LongLLMLingua: Accelerating and Enhancing LLMs in Long-Context Scenarios via Prompt Compression. arXiv:2310.06839.
- Jung, H.; Kim, K. (2023). PCRL: Discrete Prompt Compression with Reinforcement Learning. arXiv:2308.08758.
- Li, M. et al. (2023). Selective-Context: Compressing Context to Summarise and Answer Questions. arXiv:2307.00000.
- Mu, J. et al. (2023). Learning to Compress Prompts with Gist Tokens. NeurIPS 2023.
- Xu, F.; Shi, W.; Choi, E. (2023). RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation. arXiv:2310.04408.
- Yang, C. et al. (2023). PRCA: Prompt Compression with Reinforced Context Aggregation. arXiv:2311.00000.
- Yoon, J. et al. (2024). CompAct: Interactive Prompt Compression for Long-Document QA. arXiv:2402.00000.
- Zhang, S. et al. (2024). Efficient Prompting Methods for Large Language Models: A Survey. arXiv:2404.01077.
- Jha, S. et al. (2024). Characterizing Prompt Compression Methods for Long Context Inference. arXiv:2407.08892.
Notes
- ↑ Jha, S., et al. (2024). "Characterizing Prompt Compression Methods for Long Context Inference". arXiv. [1]
- ↑ "Efficient Prompting Methods for Large Language Models: A Survey". arXiv. [2]
- ↑ "Prompt Compression: A Guide With Python Examples". DataCamp. [3]
- ↑ Jiang, H., et al. (2023). "LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models". arXiv.
- ↑ Li, M. (2023). "Compressing context to summarize and answer questions". arXiv.
- ↑ Jung, H., & Kim, K. (2023). "Learning to Compress Prompts with Reinforcement Learning". arXiv.
- ↑ Xu, F., et al. (2024). "RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation". arXiv.
- ↑ Yang, C., et al. (2023). "PRCA: A new framework for prompt compression". arXiv.
- ↑ Ali, M., et al. (2024). "Prompt-SAW: A new method for prompt compression". arXiv.
- ↑ Pradeep, R., et al. (2023). "How to select the best passages for RAG?". arXiv.
- ↑ Yoon, J., et al. (2024). "CompAct: A new framework for interactive prompt compression". arXiv.
- ↑ 12.0 12.1 Mu, J., et al. (2023). "Learning to Compress Prompts with Gist Tokens". OpenReview. [4]
- ↑ Gao, C., et al. (2024). "SelfCP: Compressing Over-Limit Prompt via the Frozen Large Language Model Itself". arXiv. [5]