Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning
Abstract
Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.
Community
Nice work
I tried the demo, on the "Large Language Diffusion Models" arxiv paper (2502.09992v2) using 30x compression and it failed (starting hallucinating about llama), set it 4x compression and it worked perfectly.
But in the paper you claim 30x compression can match/surpass RAG performance. Are there other considerations at play to make this play nicely?
Thank you for trying the demo and for your feedback!
When comparing KVCompress vs. RAG, it's important to evaluate them at the same compression rate. RAG, limited by retrieval, fits only a few chunks, often missing critical information needed for broader reasoning.
Also, our method is designed to identify the most important key-value vectors using the guidance of the task description and few-shot examples. In the Hugging Face demo, the prompt designer provides default few-shot examples, but these may need to be adjusted depending on the nature of the document you are submitting (in your case, an academic paper) and the level of detail you are interested in.
For processing academic papers, a more effective set of few-shot examples might include questions that emphasize factual extraction, summarization, or key insight retrieval. Here are some examples you could try:
Examples
question: What are the key topics discussed in this document?
answer: The document covers topics such as [summarized key themes from the document].
question: What are the main claims or findings?
answer: The document states that [summary of key claims or findings].
question: What evidence or supporting arguments are provided?
answer: The document presents evidence such as [mention types of evidence used, e.g., experiments, case studies, statistics].
Let me know how it goes!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference (2025)
- Vendi-RAG: Adaptively Trading-Off Diversity And Quality Significantly Improves Retrieval Augmented Generation With LLMs (2025)
- Can LLMs Maintain Fundamental Abilities under KV Cache Compression? (2025)
- Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference (2025)
- Long-Context Inference with Retrieval-Augmented Speculative Decoding (2025)
- KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse (2025)
- Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper