arxiv:2503.04973

Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

Published on Mar 6

· Submitted by

giulio98 on Mar 11

Upvote

Authors:

Giulio Corallo ,

Orion Weller ,

Fabio Petroni ,

Paolo Papotti

Abstract

Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.

View arXiv page View PDF Add to collection

Community

giulio98

Paper author Paper submitter 1 day ago

demo: https://huggingface.co/spaces/giulio98/beyondrag

MichaelBarryUK

about 19 hours ago

Nice work

I tried the demo, on the "Large Language Diffusion Models" arxiv paper (2502.09992v2) using 30x compression and it failed (starting hallucinating about llama), set it 4x compression and it worked perfectly.

But in the paper you claim 30x compression can match/surpass RAG performance. Are there other considerations at play to make this play nicely?

giulio98

Paper author about 8 hours ago

Thank you for trying the demo and for your feedback!
When comparing KVCompress vs. RAG, it's important to evaluate them at the same compression rate. RAG, limited by retrieval, fits only a few chunks, often missing critical information needed for broader reasoning.

Also, our method is designed to identify the most important key-value vectors using the guidance of the task description and few-shot examples. In the Hugging Face demo, the prompt designer provides default few-shot examples, but these may need to be adjusted depending on the nature of the document you are submitting (in your case, an academic paper) and the level of detail you are interested in.
For processing academic papers, a more effective set of few-shot examples might include questions that emphasize factual extraction, summarization, or key insight retrieval. Here are some examples you could try:
Examples
question: What are the key topics discussed in this document?
answer: The document covers topics such as [summarized key themes from the document].
question: What are the main claims or findings?
answer: The document states that [summary of key claims or findings].
question: What evidence or supporting arguments are provided?
answer: The document presents evidence such as [mention types of evidence used, e.g., experiments, case studies, statistics].

Let me know how it goes!