Machine Learning

Prompt Caching with the OpenAI API: A Full Hands-On Python Tutorial

In my previous post, Prompt Caching – what it is, how it works, and how it can save you a lot of money and time when running high-traffic AI-powered applications. In today's post, I walk you through using Prompt Caching specifically using OpenAI's API, and we discuss some common pitfalls.


A brief reminder of Prompt Caching

Before getting our hands dirty, let's briefly revisit what the concept of Prompt Caching is. Prompt Caching is a function provided in boundary model API services such as OpenAI API or Claude's API, which allows to cache and reuse parts of LLM input that are repeated frequently. Such repeated components may be system commands or instructions that are passed to the model each time the AI ​​application is used, as well as any other unique content, such as a user query or information returned from a knowledge base. To be able to hit the cache with a fast cache, repeated parts of the data must be at the beginning of it, that is, a a quick start. Additionally, for caching to work, this prefix must pass something threshold (eg, in OpenAI the prefix must be more than 1,024 tokens, while Claude has a different minimum cache length for different models). With the satisfaction of those two conditions – tokens duplicated as a prefix exceeding the size limit defined by the API service and model – caching can be enabled to achieve economies of scale when using AI applications.

Unlike caching in other RAG components or other AI applications, fast caching works at the token level, in LLM's internal processes. In particular, the definition of LLM takes place in two steps:

  • Pre-fillthat is, LLM takes into account the user's input to generate the initial token, too
  • Recordingthat is, LLM iteratively generates output tokens one by one

In short, fast caching stores the calculations that happened in the pre-fill stage, so the model does not need to restore it again if the same start appears again. Any computation that occurs in the recoding phase, even if it is repeated, will not be cached.

For the rest of the post, I will be focusing only on the implementation of caching in the OpenAI API.


What about the OpenAI API?

In the OpenAI API, fast caching was first introduced on October 1, 2024. Initially, it offered a 50% discount on cached tokens, but today, this discount increases to 90%. On top of this, by hitting their fast cache, additional savings in latency can be achieved up to 80%.

When caching is enabled, the API service tries to cache the sent request by moving the sent data to the correct machine, where the correct cache is expected to exist. This is called Cache Routing, and to do this, the API service usually uses a hash of the first 256 tokens of the information.

Apart from this, their API also allows to clearly define the prompt_cache_key parameter in the API request to the model. That's one key that defines which cache we're referring to, which is meant to increase the chances of our information being routed to the correct machine and hitting the cache.

In addition, the OpenAI API provides two different types of temporary storage with respect to duration, defined by prompt_cache_retention parameter. Those are:

  • Memory cache caching: This is the default type of caching, available in all models where available caching is available. With the memory cache, the cached data remains active for 5-10 minutes beteen requests.
  • Extended data storage: This is available on some models. Extended cache allows storing data in the logger cache for up to a maximum of 24 hours.

Now, as to how much all these costs are, OpenAI charges a token for each (unsaved) input of the same, regardless of whether we have an open cache or not. If we manage to hit the cache successfully, we are billed for the cached tokens at a deeply discounted price, up to 90% off. In addition, the price per input token remains the same for both in-memory and extended storage.


Quick Backup Works

So, let's see how fast caching actually works with a simple Python example using OpenAI's API service. Specifically, we will create a realistic scenario where a long system command (prefix) it is reusable in many applications. If you are here, I assume you already have your OpenAI API key and have installed the required libraries. So, the first thing you should do would be to import OpenAI library, and time to capture latency, and run an OpenAI client instance:

from openai import OpenAI
import time

client = OpenAI(api_key="your_api_key_here")

then we can define our initialization (the tokens will be repeated and we intend to keep the cache):

long_prefix = """
You are a highly knowledgeable assistant specialized in machine learning.
Answer questions with detailed, structured explanations, including examples when relevant.

""" * 200  

Notice how we increase the length (multiplied by 200) to make sure that the 1,024 token buffer limit is reached. Then we set up a timer again to measure our latency savings, and finally we're ready to make our call:

start = time.time()

response1 = client.responses.create(
    model="gpt-4.1-mini",
    input=long_prefix + "What is overfitting in machine learning?"
)

end = time.time()

print("First response time:", round(end - start, 2), "seconds")
print(response1.output[0].content[0].text)

So, what do we expect to happen from here? On models from gpt-4o and later, fast caching is enabled by default, and since our 4,616 input tokens are well above the default 1,024 token limit, we're good to go. So, what this request does is that we initially check if the input is a cache hit (it's not, since it's the first time we're making a request with this start), and if it's not, we process all the input and store it. The next time we send input that matches the initial cached input tokens to some extent, we will get a cache hit. Let's test this in practice by making a second request with the same prefix:

start = time.time()

response2 = client.responses.create(
    model="gpt-4.1-mini",
    input=long_prefix + "What is regularization?"
)

end = time.time()

print("Second response time:", round(end - start, 2), "seconds")
print(response2.output[0].content[0].text)

Indeed! The second request runs much faster (23.31 vs 15.37 seconds). This is because the model has already done the calculations for the cached start and only needs to process the new part, “What is it to do?”. As a result, by using fast caching, we get very low latency and reduced costs, since the cached tokens are reduced.


Another thing mentioned in the OpenAI documentation that we already talked about is this prompt_cache_key parameter. In particular, according to the documentation, we can clearly define the instant cache key when making a request, and in this way define the requests that need to use the same cache. Anyway, I tried to include it in my example by fine-tuning the request parameters, but I didn't have much luck:

response1 = client.responses.create(
    prompt_cache_key = 'prompt_cache_test1',
    model="gpt-5.1",
    input=long_prefix + "What is overfitting in machine learning?"
)

🤔

It seems that when prompt_cache_key it exists in the API capacity, it is not exposed in the Python SDK. In other words, we can't explicitly control cache reuse yet, but it's automatic and it's a best effort.


So, what could go wrong?

Enabling caching and actually hitting the cache seems kind of straightforward from what we've said so far. So, what could go wrong, leading to a cache miss? Unfortunately, many things. As straightforward as it is, fast caching requires many different considerations to come into existence. Missing even one of those requirements will result in a cache miss. But let's take a closer look!

One glaring omission is to have a token below the threshold for activating caching, i.e., less than 1,024 tokens. However, this is easily solved – we can always increase the initial token count by simply multiplying by the appropriate amount, as shown in the example above.

Another thing would be to silently break the beginning. In particular, even when we use continuous instructions and an information system of the right size for all requests, we must be especially careful not to break the beginnings by adding any different content at the beginning of the model input, before the beginning. That's a guaranteed way to cross the cache, no matter how long and repetitive the next start is. Common suspects for falling into this trap are dynamic data, for example, entering user IDs or time stamps at the beginning of information. Therefore, the best practice to follow in all AI application development is that any dynamic content should always be included at the end of the content – not at the beginning.

Finally, it is worth highlighting that instant caching is only about the filling phase – decoding is never cached. This means that even if we force the model to generate responses that follow a certain template, starting with some fixed tokens, those tokens will not be cached, and we will be charged for processing them as usual.

On the other hand, in certain use cases, it doesn't really make sense to use caching. Such scenarios can be more powerful alerts, such as chatbots with less repetition, one-off requests, or customized real-time systems.

. . .

In my mind

Faster caching can greatly improve the performance of AI applications both in terms of cost and time. Especially when looking at scaling AI applications caching comes in handy, to keep costs and latency to acceptable levels.

OpenAI's API caching is enabled automatically and the cost of installing, unsaved tokens is the same whether we enable caching or not. Therefore, one can only win by turning on fast caching and aiming to hit it in every request, even if they fail.

Claude also provides extensive functionality for fast caching through their API, which we will explore in detail in future posts.

Thanks for reading! 🙂

. . .

Did you like this post? Let's be friends! Join me at:

📰A small stake 💌 In the middle 💼LinkedIn Buy me a coffee!

All photos by the author, unless otherwise noted.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button