AI Interview Series #4: Define KV Caching

Question:
He is applying for an LLM in manufacturing. Producing the first few tokens is fast, but as the sequence grows, each additional token takes a little longer to produce—even though the model build and hardware remain the same.
If the computer is not the main bottleneck, what inefficiencies are causing this slowdown, and how can you redesign the tokenization process to make token production much faster?
What is KV Caching and how does it speed up token generation?
KV caching is an optimization technique used during script generation in large language models to avoid unnecessary computation. In autoregressive generation, the model generates text one token at a time, and at each step it typically returns attention to all previous tokens. However, the keys (K) and values (V) listed in the previous tokens do not change.
With KV caching, the model saves these keys and values when they are calculated for the first time. When it generates the next token, it reuses the stored K and V instead of recalculating from scratch, and calculates only the query (Q), key, and value of the new token. The attention is then calculated using the cached information and the new token.
This reuse of previous computations greatly reduces redundant work, making computations faster and more efficient—especially for long sequences—at the expense of additional memory for caching. Check out Practice Notebook here

Examining the Impact of KV Caching on Inference Speed
In this code, we measure the effect of KV buffering during automatic text generation. We run the same data through the model multiple times, with KV caching enabled and once without it, and measure the average generation time. By keeping the model, speed, and generation length constant, this experiment isolates how reusing cached keys and values reduces the number of redundant attention calculations and speeds up inference. Check out Practice Notebook here
import numpy as np
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
prompt = "Explain KV caching in transformers."
inputs = tokenizer(prompt, return_tensors="pt").to(device)
for use_cache in (True, False):
times = []
for _ in range(5):
start = time.time()
model.generate(
**inputs,
use_cache=use_cache,
max_new_tokens=1000
)
times.append(time.time() - start)
print(
f"{'with' if use_cache else 'without'} KV caching: "
f"{round(np.mean(times), 3)} ± {round(np.std(times), 3)} seconds"
)


The results clearly show the effect of KV caching on speed inference. With KV caching enabled, generating 1000 tokens takes about 21.7 seconds, while disabling KV caching increases the generation time to over 107 seconds—almost a 5× reduction. This sharp difference occurs because, without the temporary storage of KV, the model returns attention to all previously generated tokens at each step, resulting in a quadratic increase in computation. Check out Practice Notebook here
With KV caching, past keys and values are reused, eliminating redundant work and keeping production time close as the sequence grows. This experiment highlights why KV caching is important for efficient, real-world applications of automatic language models.
Check out Practice Notebook here

I am a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I am very interested in Data Science, especially Neural Networks and its application in various fields.



