Context Windows Is Not Memory: What AI Agent Developers Need to Understand

0 5 7 minutes read

Context Windows Is Not Memory: What AI Agent Developers Need to Understand

In this article, you'll learn why the main content window is not the same thing as the agent's memory, and how techniques like retrieval, compression, and summarization fit into the agent's cognitive stack.

Topics we will cover include:

Why does the context window behave like a stateless scratchpad rather than persistent memory.
Retrieval-enhanced generation, compression, and compression each play a different role in controlling the input on the scratchpad.
The way agents can achieve real memory persistence is by acting as a database administrator instead of the database itself.

Introduction

Context windows are a key feature of modern AI models, especially language models, where these models can take into account and use a limited amount of input and previous conversation – usually measured as a number of tokens – at the same time when generating an answer.

When an AI lab releases a model with a core window of 2 million tokens, it's not surprising that some developers think: “Let's speed up the entire codebase! Memory problems are fixed!” However, there is a caveat. Taking a large context window as “memory”, in architectural terms, is like buying a 25-foot-wide office desk because you hesitate to find a filing cabinet. Sure, you can have all your documents placed in front of you, but as soon as the working time ends, all the desk documents are removed (by cleaning the staff!).

To clarify these differences and clarify other related concepts, this article provides a conceptual breakdown of the many layers in the cognitive stack of AI agents. We will use a few metaphors, mostly related to the office to better understand these concepts.

Content Window

The context window in the AI model, especially based on agents with underlying language models, is like a desk space or a shapeless scratchpad. It is important to note that models are inherently incomplete. Regardless, every API call to a model starts at “step zero”.

When you pass the agent a chat history that includes 200K tokens (a large context window), it doesn't remember what happened at the previous step in time. Instead, it quickly relearns its “universe” from scratch in a matter of milliseconds. Over time, relying on this strategy in agent-based environments can introduce several dangerous (if not fatal) pitfalls:

AI models act like a lazy reader, paying more attention to the first and last parts of the main message (text), but glossing over ideas and facts buried deep in the middle parts.
There is a snowballing effect: as the conversation progresses, the agent has to resubmit and reread the entire history at each step, including previous, often useless curves.
In terms of delay, there is a “brain freeze” effect, so that against a large wall of text, the model will take some time until it starts generating the very first word in its response.

To do this in concrete, consider what a single API call actually looks like under the hood. Because the model does not hold memory between calls, each previous iteration must be fully captured just to ask one new query:

model.generate(messages=[ {“role”: “user”, “content”: “Step 1: Let’s call this variable `session_id`.”}, {“role”: “assistant”, “content”: “Got it, I’ll use `session_id` going forward.”}, # … every intervening turn must be resent, every single time … {“role”: “user”, “content”: “Step 47: What variable name did we agree on back in step 1?”} ])

model.produce(

messages=[

{“role”: “user”, “content”: “Step 1: Let’s call this variable `session_id`.”},

{“role”: “assistant”, “content”: “Got it, I’ll use `session_id` going forward.”},

# … every intervening turn must be resent, every single time …

{“role”: “user”, “content”: “Step 47: What variable name did we agree on back in step 1?”}

]

)

Step 47 alone forces the whole desk – all 46 turns – back to the table, to answer the question about step 1. That is the result of the above-described snowfall, made of concrete.

Retrieval

Retrieval-augmented generation (RAG) systems are like a giant bookshelf in every office room, helping to retrieve static, existing data relevant to the current step in a “Just-in-Time” fashion. RAG systems pull relevant high-K document fragments from the scratchpad (context window) as the user asks a specific query: the documents returned are, of course, those determined to be most semantically relevant to the user's query or prompt.

When agents are on the go, things are not so simple, however, since vector similarity (a type of similarity measure and data representation used in RAG systems) does not necessarily equate to semantic truth in certain situations. For example, suppose a user tells their scheduling agent to move a meeting to Friday, and later says “cancel Thursday, Alice is sick.” A vector search engine may return both statements in the document domain, even though they are contradictory. The agent and its associated language model should be able to act like accountants who are able to decide which statement best reflects current reality.

The naive RAG pipeline simply compiles whatever it finds and leaves the model to guess which command is still running. The most reliable pattern resolves conflicts before they occur, for example by selecting the most recently recorded statement:

returned_parts = [ {“text”: “Move meeting to Friday”, “timestamp”: “2025-01-10T09:00:00”}, {“text”: “Cancel Thursday, Alice is sick”, “timestamp”: “2025-01-12T14:30:00”} ]# Merge conflicting chunks before they reach the relevant_data = max(retrieved_chunks, key=lambda chunk: chunk[“timestamp”])

returned_episodes = [

{“text”: “Move meeting to Friday”, “timestamp”: “2025-01-10T09:00:00”},

{“text”: “Cancel Thursday, Alice is sick”, “timestamp”: “2025-01-12T14:30:00”}

]

# Combine conflicting pieces before they reach the prompt

latest_priority = plural(returned_episodes, the key=lambda part: part[“timestamp”])

That one line of reasoning is the difference between an agent who confidently repeats an old order, and one who knows full well that a meeting has been cancelled.

Pressure

This is easy to understand if you are used to compressing in ZIP files. In the context of agents and language models, this includes algorithmic token reduction: keeping the data key intact, while its physical footprint within the data at a given step is reduced. There are techniques such as stripping syntax, passing the raw text to some compression model such as LLMLingua, or Prompt Caching, to do this. This is, in fact, a bandwidth optimization game to be used in situations like compressing a 15K token JSON load down to 5K, thus leaving enough space for the scratchpad in the model to do its main work.

In practice, this may look as simple as directing a large payment through a pressure model before it reaches the main notification:

raw_payload = json.dumps(large_api_response) # about 15,000 tokens compressed_payload = compress_with_llmlingua(raw_payload, target_token_count=5000 ) prompt = f”Given this data: {compressed_payload}nUsers'n

raw_payload = json.dumping grounds(api_main_response) # about 15,000 tokens

compressed_payload = press_with_llmlingua(

raw_payload,

target_token_count=5000

)

immediately = f“Given this data: {compressed_payload}nnAnswer the user's question.”

Basic truths survive the journey without change; only their feet on the desk are wrinkled.

To summarize

Unlike compression, compression removes the original data and replaces it with a summary. It should be taken for what it is: a one-way journey that is irreversible in nature. A good, almost essential practice when using content summarization, therefore, is to use forked storage: dump the raw documents in cheap storage such as S3 buckets or basic SQL tables, and transfer the combined summary to the active notification.

That forked storage pattern can be easily expressed as a two-step write, one to cold storage and one to active information:

def summarize_turn(raw_transcript, session_id, turn_id): # 1. Continue raw, unsummarized transcripts to cold storage s3_client.put_object( Bucket=”agent-transcripts”, Key=f”{session_id}/turn_{turn_id}.json”, Genetrascript2 activrate quick summary = summarizer_model.generate(raw_transcript) # 3. Only summary re-enters shortcut to restore the context window

def shorten_turn(transcript_green, session_id, turn_id):

# 1. Continue with raw, uncondensed transcripts in a cold place

s3_client.put_item(

A bucket=“agent documents”,

The key=f“{session_id}/turn_{turn_id}.json”,

The body=green_the transcript

)

# 2. Make a comprehensive summary of the relevant information

summary = summary_model.produce(transcript_green)

# 3. Only the shortcut re-enters the context window

come back summary

If a later step requires the original data, it can always be retrieved from S3. Summarization, unlike compression, does not require reconstruction within the active information itself.

Memory Persistence as a State Machine

Memory persistence in agents is often taken for granted, especially by junior developers. But to give the agent real memory, it should not act as a database, but as a database manager. Let's say a user says, “My dog's name is Goofy, but we might rename him Pluto”. After that the agent should be able to transparently make a tool call like this:

{ “tool”: “update_entity_graph”, “params”: { “subject”: “User_Dog”, “attribute”: “Name”, “value”: “Goofy”, “notes”: “Think Pluto” } }

{

“tool”: “update_entity_graph”,

“parameters”: {

“title”: “User_Dog”,

“attribute”: “name”,

“value”: “Goofy”,

“notes”: “Thinking of Pluto”

}

It doesn't matter if it is supported by a standard SQL table, a data graph, or Redis: either way, the agent must be taught to query the state machine at the beginning of every opportunity, and commit to it at the end of that curve. As a loop, this query-then-commit directive looks like this:

def agent_turn(user_message, entity_graph): # The current state of the query at the START of each turn current_state = entity_graph.query(subject=”User_Dog”) response = model.generate( messages=[{“role”: “user”, “content”: user_message}]context=current_state ) # Do any updates at the END of all call possibilities in response.tool_calls: entity_graph.update(**call.params) return response

def agent_turn(user_message, business_graph):

# Ask the status quo at the start of every opportunity

current_state = business_graph.the question(the subject=“User_Dog”)

the answer = model.produce(

messages=[{“role”: “user”, “content”: user_message}],

context=at the moment_situation

)

# Make any updates at the END of all opportunities

for call in the middle the answer.tool_calls:

business_graph.review(**call.parameters)

come back the answer

Wrapping up

With these ideas, you should now have a clear picture of the elements that play a role in context management in agents built on language models. The lesson is simple: stop trying to buy a big, 10 million token desk. Instead, just get a regular desk, give your agent a sharp pencil, and teach him how to open a filing cabinet and make the most of its contents to do his job.

Source link

nimda 3 weeks ago

0 5 7 minutes read