KV Caching Explained: Optimizing Transformer Inference Efficiency

Community Article Published January 30, 2025

Introduction

Prerequisites

Standard Inference and the Rise of KV Caching

How Does KV Caching Work?
Step-by-Step Process

Comparison: KV Caching vs. Standard Inference

Practical Implementation

Conclusion

References & Further Reading

Introduction

When AI models generate text, they often repeat many of the same calculations, which can slow things down. Key-Value caching is a technique that helps speed up this process by remembering important information from previous steps. Instead of recomputing everything from scratch, the model reuses what it has already calculated, making text generation much faster and more efficient.

In this blogpost, we’ll break down KV caching in an easy-to-understand way, explain why it’s useful, and show how it helps AI models work faster.

Prerequisites

To fully grasp the content, readers should be familiar with:

Transformer Architecture: Familiarity with components such as the attention mechanism.
Autoregressive Modeling: Understanding of how models like GPT generate sequences.
Linear Algebra Basics: Concepts like matrix multiplication and transposition, which are essential for understanding attention computations.

This 👉 BLOG should cover up most of the prerequisites needed for this article.

click here for some of the most essential takeaways.

attention weight has a shape of $[\text{batch}, h, \mathrm{Seq}_{\mathrm{len}}, \mathrm{Seq}_{\mathrm{len}}]$
masked multi-head attention allows each token to be represented by itself and all the previous tokens.
to generate a new token the model needs to look at all the previous tokens and their representations by their preceding tokens

Standard Inference and the Rise of KV Caching

When a model generates text, it looks at all the previous tokens to predict the next one. Normally, it would repeat the same calculations for every new token, which can slow things down.

KV caching solves compute overlap by remembering these calculations from previous steps, this can be achieved by storing the intermediate states of attention layers during inference.

How Does KV Caching Work?

Step-by-Step Process

First Generation: When the model sees the first input, it calculates and stores its keys and values in the cache. $\Downarrow$
Next Words: For each new word, the model retrieves the stored keys and values and adds the new ones instead of starting over.
Efficient Attention Computation: calculate attention using the cached $K$ and $V$ along with the new $Q$ (query) to compute the output.
Update Input: add the newly generated token to the input and $\texttt{go back to step 2}$ until we finish generating.

The process is illustrated below:

Token 1: [K1, V1] ➔ Cache: [K1, V1]
Token 2: [K2, V2] ➔ Cache: [K1, K2], [V1, V2]
...
Token n: [Kn, Vn] ➔ Cache: [K1, K2, ..., Kn], [V1, V2, ..., Vn]

KV Caching	Standard Inference

In the table above we used a $d_{k} = 5$ for better visuals, note that this number can be significantly bigger than what we have presented.

Comparison: KV Caching vs. Standard Inference

Here’s how KV caching compares to the regular generations :

Feature	Standard Inference	KV Caching
Computation per Word	The model repeats the same calculations for every word.	The model reuses past calculations for faster results.
Memory Usage	Uses less memory at each step, but memory grows with longer texts.	Uses extra memory to store past information, but keeps things efficient.
Speed	Gets slower as the text gets longer because it repeats work.	Stays fast even with longer texts by avoiding repeated work.
Efficiency	High computational cost and slower response times.	Faster and more efficient since the model remembers past work.
Handling Long Texts	Struggles with long texts due to repeated calculations.	Perfect for long texts as it remembers past steps.

KV caching makes a big difference in speed and efficiency, especially for long texts. By saving and reusing past calculations, it avoids the need to start over each time, making it much faster than the regular way of generating text.

Practical Implementation

This is a simplified example of implementing KV caching in PyTorch:

# Pseudocode for KV Caching in PyTorch
class KVCache:
    def __init__(self):
        self.cache = {"key": None, "value": None}

    def update(self, key, value):
        if self.cache["key"] is None:
            self.cache["key"] = key
            self.cache["value"] = value
        else:
            self.cache["key"] = torch.cat([self.cache["key"], key], dim=1)
            self.cache["value"] = torch.cat([self.cache["value"], value], dim=1)

    def get_cache(self):
        return self.cache

When using the transformers library this behavior is enabled by default through the use_cache parameter, you can also access multiple caching methods through the cache_implementation parameter, here's a minimalistic code :

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('HuggingFaceTB/SmolLM2-1.7B')
model = AutoModelForCausalLM.from_pretrained('HuggingFaceTB/SmolLM2-1.7B').cuda()

tokens = tokenizer.encode("The red cat was", return_tensors="pt").cuda()
output = model.generate(
    tokens, max_new_tokens=300, use_cache = True # by default is set to True
)
output_text = tokenizer.batch_decode(output, skip_special_tokens=True)[0]

We benchmarked the code above with/without kv caching on a T4 GPU we got the following results :

with KV Caching	Standard Inference	Speedup
11.7 s	1min 1s	~5.21x times faster

Conclusion

KV caching is a simple but powerful technique that helps AI models generate text faster and more efficiently. By remembering past calculations instead of repeating them, it reduces the time and effort needed to predict new words. While it does require extra memory, this method is especially useful for long conversations ensuring fast and efficient generation.

Understanding KV caching can help developers and AI enthusiasts build faster, smarter, and more scalable language models for real-world applications.

I would like to extend my sincerest gratitude to Aritra Roy Gosthipaty 🤗 for his invaluable support, feedback, and dedication in developing this blog post.

References & Further Reading

Community

ryg81

Jan 31

Can this be similar for image generation models? (I am not a programmer :- or expert in AI))

olegGerbylev

Apr 1

This comment has been hidden (marked as Spam)

emilibennett

Jun 11

This comment has been hidden (marked as Spam)

mbcool

Jun 23

Great reference, thanks for posting.

not-lain

Article author Jul 16

Thanks a lot for the kind words (≧∇≦)ﾉ✨

dutta18

Sep 13

I really appreciate the effort that HF team puts in to create these easy-to-digest blogs. Thanks a ton !

not-lain

Article author Sep 13

Very grateful for the kind words @dutta18 🤗

jonathon1964

Sep 26

very clear ex!

Student-Xiaoji

Oct 26

love this simple, easy understood and straight forward explanation❤
thanks for you effort☺

not-lain

Article author Oct 27

thanks for the kind feedback 🤗

seldn

Oct 30

This was a great read. Thanks for making this.

not-lain

Article author Nov 16

don't mention it (*/ω＼*)

chunlinyang

Nov 7

It helps me understand KV cache better. Ty.

not-lain

Article author Nov 16

thanks a lot for the kind warm words 🤗

KANGKKANG

Nov 14

maybe i can use this job on ACT model?

kyars

10 days ago

I didn't understand the explanation

not-lain

Article author 8 days ago

Hi @kyars is there any part that you think i can improve upon or is it everything?
would appreciate any feedback!

talrejaa8

4 days ago

I really appreciate your effort to explaining this so well. Just one doubt I have, what exactly is being cached?

The QK^t dot product results and the Value vectors of the already generated tokens
or
The just the key vectors and the value vectors of already generated tokens?

Also, is this done for each transformer block in an LLM?

kyars

4 days ago

Yes, it's done for each transformer block in an LM because each transformer block has different attention heads. If you do it for only one transformer block across all blocks, then you don't get the same representation.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

198

KV Caching Explained: Optimizing Transformer Inference Efficiency

Introduction Prerequisites Standard Inference and the Rise of KV Caching How Does KV Caching Work? Step-by-Step Process Comparison: KV Caching vs. Standard Inference Practical Implementation Conclusion References & Further Reading Introduction

Prerequisites

Standard Inference and the Rise of KV Caching

How Does KV Caching Work?

Step-by-Step Process

Comparison: KV Caching vs. Standard Inference

Practical Implementation

Conclusion

References & Further Reading

Community

Introduction

Prerequisites

Standard Inference and the Rise of KV Caching

How Does KV Caching Work?
Step-by-Step Process

Comparison: KV Caching vs. Standard Inference

Practical Implementation

Conclusion

References & Further Reading

Introduction