Persistent Latent Memory for Multi-Hop LLM Agents: How a 6G Handover Paper Closes the Agent Cold-Start

0 0 30 minutes read

Persistent Latent Memory for Multi-Hop LLM Agents: How a 6G Handover Paper Closes the Agent Cold-Start

A humorous-but-real tour of ILCP-for-agents — a β-VAE compressor, an Xn-style transport, a gated MLP projector, and the unreasonably convenient realization that I had already solved this exact problem for 6G handovers. The agent-side V1 is the wiring; the receipts in this post are the 6G paper’s receipts, properly labelled, because honest writing is the whole point of this series.

— the capstone — of the “Production-Grade Agentic Inference” series. Each part removed one kind of redundant work from an agentic LLM pipeline. Part 1 killed redundant prefill (don’t read the same document twice). Part 2 killed redundant waiting (don’t queue fifty agents single-file). Part 3 killed redundant CPU round-trips (don’t bounce every retrieval off the GPU). Part 4 (this post, and the final one) kills redundant context rebuilds — the agent equivalent of throwing away your hidden state every time the conversation hands off to a new specialist.

Key Takeaways

The problem: in a multi-hop agent pipeline, every time the control shifts from agent A to agent B, the receiver throws A’s hidden state away and rebuilds the context from a prompt string. That is structurally the same “post-handover cold start” a user equipment (UE) suffers when it moves between two base stations (from source to target), where the target base station re-initialises the per-user recurrent state from scratch.

The fix: compress the sender’s recurrent state into a tiny latent payload, transport it across the hand-off, and let the receiver use it as a soft-prompt prefix instead of re-prefilling everything from text. The same “compute once, fan out shared state” lesson the series has been hammering since Part 1, applied across reasoning hops instead of within one pipeline.

The ‘un’usual receipts: the underlying method is Inductive Latent Context Persistence (ILCP), a peer-reviewed paper I co-authored recently, accepted at AI4NextG @ ICML 2026. On the Vienna 4G/5G drive-test, ILCP eliminates ping-pong handovers entirely (0.0% vs 6.5% no-transfer baseline, 22.6% Transformer baseline), recovers post-handover accuracy at +5.1 pp average / +13.3 pp peak, and runs end to end at 7.7 ms p99 per handover decision on the same GTX 1080 as the rest of this series.

The honest part: those numbers are 6G radio handover numbers, not LLM-agent numbers. The agent-side V1 in this post (ilcp-for-agents) is the wiring — a β-VAE compressor, an in-process transport, a gated MLP projector, and a Qwen2.5-7B harness — and its agent-side benchmarks are explicitly future work. I am refusing to launder RAN receipts as LLM receipts even where the temptation is high.

The kicker: the telecom thread that ran through Parts 1–3 as analogy is, in Part 4, my published research solving the same problem in two different industries. The series closes the loop.

TL;DR: Multi-hop LLM agents currently hand off context as a string. Agent A finishes reasoning, summarises it into prompt text, and Agent B reads that string from scratch — the receiver’s KV cache, attention pattern, and any partial computation Agent A built up are all discarded. This is the agent version of the post-handover cold start that 5G/6G base stations suffer when a UE (mobile device) moves between two base stations: the target base station re-initialises the per-UE recurrent state and has to rebuild it from scratch. We solved that problem with a method called Inductive Latent Context Persistence (ILCP): a β-VAE compresses a 128-dim GRU state into a 128-byte latent payload, transports it over the standard 3GPP Xn interface, and then a gated MLP projects it into the target base station’s state space at handover. On the Vienna 4G/5G drive-test, ILCP eliminates ping-pong handovers (0.0% vs 6.5% no-transfer baseline), recovers post-handover next-cell accuracy by +5.1 pp average / +13.3 pp peak in the 50–250 ms post-handover window, and runs at 7.7 ms p99 per decision on a single GTX 1080. This part maps that exact same protocol onto LLM agent hand-offs: ilcp-for-agents learns to compress a pooled hidden summary, transport it across the hand-off, and project it back into a receiver-side soft-prompt prefix. V1 is the wiring (PyTorch, Qwen2.5-7B-Instruct, β-VAE, gated MLP, in-process transport, toy exact-match metric). The contribution here is the architectural transfer, not the numbers.

Github Repo:

(Quick confession before we start: I came at this whole series from a 5G/6G RAN engineering background. The telecom angles in Parts 1, 2, and 3 were the analogies. This part stops being an analogy. The mechanism I am mapping onto LLM agents is the same mechanism, written by the same co-authors, that closes the post-handover cold start in 6G radio access networks. That is the series capstone. There is a whole section on the side-by-side — section 7 — but it is also why this post exists in the shape it does.)

Architecture mental model — keep this open while you read.

Agent A context → masked-mean-pool → β-VAE encoder → z (32-dim latent) → in-process TransportPayload → β-VAE decoder + gated MLP → K memory tokens → torch.cat onto Agent B question embeds → greedy decode

Everything below is just commentary on one piece of that line.

Compress, transport and project

1. A confession: I solved this problem before I knew I had it

In Part 3, we went to slightly absurd lengths to keep our tensors exactly where they belong: on the silicon. By writing a custom CUDA kernel for Top-K retrieval, we killed the redundant CPU round-trips that drag down agentic RAG. The philosophy was absolute—once the GPU computes a rich, high-dimensional state, you do not move it, and you certainly do not destroy it. And yet, the moment that highly-optimized retriever finishes its job and passes the baton to the next specialist in your pipeline, standard frameworks force you to do exactly that. We guard our tensor state with our lives inside a single node, only to voluntarily throw it in the trash the second we cross a reasoning hop.

Let me dramatize the agent hand-off the way every multi-hop pipeline does it today.

You: “Agent A, read this 50-page report, create a summary and hand it off to Agent B for fact-checking.”

Agent A: “Sure. Loading model. Reading the report. Pooling the context. Building attention over paragraph 47. Forming my opinions. ✅”

GPU works for 30 seconds.

Agent A: “Done. Here is a 200-token summary I am very proud of.”

You: “Great. Forwarding to Agent B.”

Agent A: “Wait, how exactly are you forwarding?”

You: “…as a string? In the prompt?”

Agent A: “Right. So you are sending Agent B my final string. Not my hidden state. Not my calculated attention over the 50 pages I just read. Not the fact that paragraph 47 was unusually load-bearing. Not the calibrated confidence I built up. Just the string.”

You: “That is how it works, yes.”

Agent A: “Cool. Cool cool cool. Have fun, B. 👋”

Agent B: “Hello, I am a beautiful, stateless newborn. Loading model. Reading Agent A’s string from scratch. Building context. Pooling. Forming opinions. ✅”

GPU spends another 30 seconds doing essentially the same work Agent A just finished doing.

You: “…is there a way to skip the second context build-up and attention calculation?”

Agent B: “What second read?”

That, right there, is what every “multi-agent swarm” I have ever seen actually looks like under the lid. Each hand-off is a string-shaped throat that the sender’s internal state cannot squeeze through. The receiver gets the output text and rebuilds context from text — which is the most expensive thing a transformer usually does, and the thing this series has spent three parts trying to convince you to stop doing within one pipeline run.

Funny fact is that I have written about this problem before, only just not for LLM agents.

In 2026 my co-author and I published a paper called “Inductive Latent Context Persistence: Closing the Post-Handover Cold Start in 6G Radio Access Networks.” The setting is a mobile phone (also called user equipment or UE) moving between 5G/6G base stations (also called gNBs). At every handover, the target gNB discards the per-UE recurrent state held at the source gNB and re-initialises the per-UE hidden state at the target gNB. The target-side prediction model then has to rebuild that state from the few post-handover radio measurements it has just received, while the UE is already moving. The paper calls this the post-handover cold start. Ringing any bell?

Now read this paragraph from the paper’s contributions, lightly de-jargoned: “We treat the per-user recurrent state as portable network context. To address the practical issue that the standard inter-cell message has a small size budget, we show that a 128-byte differential update is sufficient to preserve the predictive quality of a 128-dimensional GRU state across the handover boundary. Our proposed ILCP protocol compresses the hidden state with a β-variational autoencoder, transports it on the standard 3GPP Xn interface, and projects it onto the target gNB’s state space at the moment of handover via a learned, gated MLP.”

If you swap “source gNB” for “agent A,” “target gNB” for “agent B,” and “radio measurements” for “tokens of the next sub-task,” you have the architecture this whole post is about. Same paper, same author, just the application domain is different.

The contribution of this post is not the method — the method is already in the paper. The contribution of this post is the mapping: taking ILCP and wiring it up for multi-hop LLM agents, end to end, in a small PyTorch repo which you already saw. The receipts you are about to see in section 5 are the paper’s receipts, in 6G handover units, properly labelled.

2. Why does the agent cold-start exist at all? (a one-minute crash course)

Skip this section if you already wired up a multi-hop agentic pipeline in the past. For everyone else, here is the short version.

A multi-hop agent is not one model answering one question. It is several specialised models taking turns at a job. A router decides intent, a planner decomposes the task, a retriever fetches grounded context, a reasoner does the actual thinking, a safety checker sniffs the output, and a finaliser writes the response. Each of these is its own forward pass, often its own model, sometimes its own process.

Between any two of those forward passes, control hands off. And here is the dirty secret that most agent frameworks paper over with friendly diagrams: at every hand-off, the receiver gets text. Not a hidden state, not a KV cache, not even a vector — text. The router’s intent classification turns into a token string. The planner’s task plan turns into a JSON blob. The retriever’s selected chunks turn into a concatenated context window. The reasoner’s chain of thought turns into a “thought:” block. Every hand-off is a tokeniser round-trip.

That sounds fine until you count. A standard agentic pipeline with four hops over a long shared context will tokenise and re-prefill the same source material four times. Three of those reads add nothing. The router already knew, the planner already knew, the reasoner already knew. The fourth read is doing it for the safety checker’s benefit, and the safety checker is going to do it again for the finaliser.

If that sounds like the SwarmKV problem from Part 1, it is — but rotated 90 degrees. Part 1 was about N agents reading the same document inside one pipeline run. This part is about N agents reading each other’s accumulated context across separate reasoning hops. Different shape, same underlying tax: the receiver is throwing away expensive computed state and rebuilding it from scratch from a prompt string. That is the post-handover cold start, dressed up in a Python wrapper.

3. The “just send a latent across the hand-off” lightbulb (and why it’s harder than it sounds)

The pitch is simple:

While Agent A is still alive, pool its working context into a single fixed-size summary vector $s_A$ sA.
Compress $s_A$ sA with a β-VAE into a tiny latent $z$ z of, say, 32 floats. That is 128 bytes at fp32.
Hand $z$ z across the boundary as the only thing that crosses the hop.
At Agent B, decode $z$ z and project it through a gated MLP into K memory vectors in B’s own embedding space.
Concatenate those K vectors in front of the question token embeddings and let B greedy-decode. B never re-reads A’s context as text.

That is “compute once, hand off the compressed state” — the same lesson as Parts 1–3, applied across hops instead of within one pipeline. The only reason it takes more than a 30-line PyTorch script is that three boring problems immediately break the naive approach.

Problem A: What do you actually carry across the hand-off?

The clearest answer is “the whole KV cache.” Agent A built it; just hand it to Agent B. However, the fact is that the KV cache is a per-context object that depends on the model, the tokenizer, the quantization, the RoPE configuration, the attention implementation, the layer count, the head count, the GQA ratio, the n_ctx, and the exact GGUF / safetensors hash. Hand a KV cache from a process running model X to a process running model Y and you have shipped a binary blob the receiver cannot interpret. The on-disk KV roadmap item in SwarmKV’s V1 limitations exists precisely to make those invariants explicit; until then, every “share the KV cache between agents” idea has to negotiate a vocabulary of seven matching fields before the bytes mean anything.

The ILCP V1 makes the boring choice on purpose: don’t carry the KV cache, carry a learned summary of what the KV cache was trying to represent. A single pooled hidden-state vector is model-version-fragile but not catastrophically so — it is just (hidden_size,) floats, and if Agent A and Agent B share the same base model it is unambiguous what those floats mean. From src/agents/qwen_encoder.py:

@staticmethod
def masked_mean_pool(last_hidden: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
    """
    Approved V1 pooling: average final-layer token vectors with padding masked to zero mass.

    Dividing by the raw token count (not L2-normalizing) preserves magnitude cues about confidence
    and saturation that a unit-norm pool would erase before the VAE bottleneck.
    """
    if last_hidden.dim() != 3:
        raise ValueError("last_hidden must be (batch, seq, dim).")
    if attention_mask.dim() != 2:
        raise ValueError("attention_mask must be (batch, seq).")
    mask = attention_mask.unsqueeze(-1).to(dtype=last_hidden.dtype, device=last_hidden.device)
    summed = (last_hidden * mask).sum(dim=1)
    lengths = mask.sum(dim=1).clamp(min=1.0)
    return summed / lengths

We prefer masked mean over the final-layer hidden states. The comment is honest about the design choice: we do not L2-normalise on the way out, because magnitude carries useful signal about confidence and saturation that a unit-norm pool would erase. The downstream β-VAE bottleneck is the thing that decides what to keep and what to drop, not the pooling layer.

Problem B: How small can the payload get and still be useful?

A pooled hidden vector from a 7B model is (4096,) floats — sixteen kilobytes. That is fine for in-process toy demos, but the moment the transport becomes a real network message, sixteen kilobytes per hand-off is not free. The radio paper’s whole argument is that a 128-byte differential update is sufficient to preserve the predictive quality of a 128-dimensional GRU state, because the GRU state lives on a low-dimensional task-relevant manifold and a β-VAE can find that manifold during joint training with the downstream loss.

The agent-side V1 makes the same architectural option. From src/compressor/beta_vae.py:

class BetaVAE(nn.Module):
    """
    Fully-connected β-VAE operating on pooled LM hidden states.

    Why fully-connected instead of convolutions: the input is already a single vector per sample
    (masked mean pool over time), so conv layers would add parameter overhead without exploiting
    spatial locality the way a CUDA kernel would on grids.
    """

    def __init__(
        self,
        input_dim: int,
        latent_dim: int,
        hidden_dim: int,
        beta: float = 1.0,
    ) -> None:

The two encoder heads return μ and logvar separately, because tying them would constrain curvature; the reparameterisation trick keeps sampling differentiable; the closed-form KL divergence is the standard analytic expectation under a diagonal Gaussian encoder. None of this is new VAE work, Higgins et al. (2017) and Kipf & Welling (2016) did the heavy lifting a decade ago, and the comments in the code are explicit that none of it pretends to be. The contribution is what you attach the VAE to, not the VAE itself.

There is a small helper in the same file that exists purely for honest reporting:

def latent_payload_bytes(latent_dim: int, dtype: torch.dtype) -> int:
    """
    Report transferable payload size in bytes for README receipts (not assumed 128-byte telecom payload).

    Element size follows torch.dtype element alignment; this is the on-wire analog for in-process transport.
    """
    # torch.finfo / element_size gives byte width for floating dtypes used in z tensors.
    if not dtype.is_floating_point:
        raise ValueError("latent_payload_bytes expects a floating dtype for z.")
    width = torch.tensor([], dtype=dtype).element_size()
    return int(latent_dim) * int(width)

The docstring says “not assumed 128-byte telecom payload.” The agent-side code refuses to assume the 6G paper’s 128-byte number — it computes the agent-side payload from the actual latent dim and dtype on the run that you ran, so the README receipt for the agent side will tell its own truth. That single helper is the entire “do not launder RAN numbers as LLM numbers” policy, in seven lines.

Problem C: How does the receiver actually use the latent?

Here is where the radio paper and the LLM mapping diverge a little, because the receiver is not the same shape of object in the two domains. In the radio paper, the receiver is a target base station running its own GRU + heterogeneous graph transformer, and the projection block lands the decoded latent back into the target’s recurrent state space. In the agent-side V1, the receiver is a frozen Qwen2.5-7B-Instruct decoder, and the projection block has to land the decoded latent into something the decoder can attend over.

The cleanest thing a frozen decoder can attend over is its own token embedding space. So the projector lifts the latent into K memory vectors that live in the same vector space as real token embeddings, and the receiver concatenates them in front of the question. From src/projector/gated_mlp.py:

class GatedLatentToMemoryProjector(nn.Module):
    """
    Map z ∈ R^{latent_dim} to memory ∈ R^{K × model_hidden}.

    The gate uses a SiLU nonlinearity (smooth ReLU) so gradients do not die on negative pre-activations
    the way they would with a plain sigmoid gate saturating at initialization.
    """

The trunk is a small MLP; the gate head and the value head split off it; the gate is sigmoided, the value is left raw, and an elementwise product squashes out spurious latent directions before the final reshape into (batch, K, model_hidden). Two paths, one product, one reshape. The comment about applying the gate in fp32 even when the LM runs in bf16 is a Pascal-era footnote: the GTX 1080 used as the canonical reference for this series does not have a full-speed bf16 ALU, so gating in fp32 is just polite to the hardware.

The exact concatenation happens inside the harness. From src/agents/harness.py:

prefix = torch.cat([memory_tokens.unsqueeze(0), q_embeds], dim=1)

That is the entire act. Memory tokens go in front; question token embeds follow. The receiver decodes from this prefix using inputs_embeds instead of input_ids, and the rest of the generation loop is a standard greedy decode that reuses KV cache entries the way any production decoder would after the first wide forward pass.

These three problems — what to carry, how small to make it, how to inject it on the other side — are the entire substance of “send a latent across the hand-off.” Section 4 walks through how the pieces glue together end to end.

4. The compress → transport → project pipeline (the actually-cool part)

The whole agent-side pipeline is six steps, with the wiring living in four files. Picture it as a horizontal pipe.

Step 0:  Construct compressor + projector + transport at engine init   (load_ilcp_modules)
Step 1:  Encode Agent A's context to a pooled summary s_A              (QwenContextEncoder.pooled_embedding_for_text)
Step 2:  Compress s_A to a latent z via the β-VAE encoder mean μ       (BetaVAE.encode)
Step 3:  Pack z into a TransportPayload (CPU staged bytes)             (InProcessTransport.pack)
Step 4:  Unpack on the receiver and project z to K memory tokens       (InProcessTransport.unpack + GatedLatentToMemoryProjector)
Step 5:  Concat memory in front of question embeds and greedy-decode   (greedy_generate_from_memory_prefix)

Let’s walk each one with the real code. Snippets are deliberately short; the full files are tiny and worth looking into.

Step 1 — Pool Agent A’s context

This is the boring step that makes the rest of the pipeline possible. Agent A reads its working context with the same Qwen2.5-7B-Instruct that Agent B will run, but instead of decoding tokens out, we ask the model to give us the final-layer hidden states and pool them into a single vector. From src/agents/qwen_encoder.py:

@torch.inference_mode()
def encode_contexts(
    self,
    texts: list[str],
    max_length: int = 512,
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Return pooled embeddings (batch, hidden) and the attention masks used for auditing shapes.

    torch.inference_mode() disables version counter bookkeeping entirely vs no_grad for slightly lower overhead
    when sweeping thousands of contexts for compressor dataset construction on a budget GPU.
    """
    batch = self.tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="pt",
    )
    batch = {k: v.to(self.device) for k, v in batch.items()}
    outputs = self.lm(**batch, output_hidden_states=True, use_cache=False)
    last_hidden = outputs.hidden_states[-1]
    pooled = self.masked_mean_pool(last_hidden, batch["attention_mask"])
    return pooled, batch["attention_mask"]

One forward pass with output_hidden_states=True, the last layer’s hidden states, the masked mean pool from section 3, out. The pooled tensor is (batch, hidden_size), which for Qwen2.5-7B is (1, 4096) per sample. That is the s_A vector the paper’s section 4 talks about — except for a phone moving between cell towers, the same vector is built by a 128-dim GRU over radio measurements. Different sensors, same role.

Step 2 — Compress to a latent z

From src/agents/harness.py:

@torch.inference_mode()
def ilcp_memory_from_context(self, context: str, device: torch.device) -> torch.Tensor:
    """
    Compress→transport→project pipeline returning receiver memory tensor (K, D) on `device`.

    Using the VAE encoder mean μ (not a stochastic sample) stabilizes multi-trial latency and quality
    comparisons the way a deployed system would freeze stochasticity after calibration.
    """
    s_a = self.encode_sender_summary(context).to(device)
    mu, _logvar = self.vae.encode(s_a.unsqueeze(0))
    z = mu.squeeze(0)
    payload = self.transport.pack(z)
    z_b = self.transport.unpack(payload, device=device)
    mem = self.projector(z_b.unsqueeze(0)).squeeze(0)
    return mem

That is the whole compress→transport→project hop. The docstring makes a quiet but important production choice explicit: we use the encoder mean μ at inference, not a stochastic reparameterised sample. The radio paper does the same thing in deployment; the stochasticity is only useful during training. Freezing it after calibration is what every shipped VAE-bottleneck system does, and the comment says so.

Inside the β-VAE, encode returns mean and logvar; we drop logvar because we are not sampling here. From src/compressor/beta_vae.py:

def encode(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Map a batch of summary embeddings to diagonal-Gaussian parameters.

    Returning logvar instead of std avoids a sqrt during training and improves numerical stability
    when variances become tiny (avoids division blow-ups in KL closed form).
    """
    # Flatten optional middle dimensions so the encoder always sees (batch, input_dim).
    h = self._encoder_body(x)
    mu = self._enc_mu(h)
    logvar = self._enc_logvar(h)
    return mu, logvar

Standard β-VAE encoder. The two separate heads for μ and logvar exist because tying them would constrain the curvature of the latent geometry, which the comment lays out in one sentence so the next reader does not have to wonder.

Step 3 — Pack into a TransportPayload

The transport boundary is deliberately explicit. Even though Agent A and Agent B share the same Python process in V1, the hand-off is wrapped in a serialise/deserialise step so that future versions can swap in a real network — gRPC, shared-memory ring buffers, a wire protocol — without rewriting the call sites. From src/transport/in_process.py:

@dataclass(frozen=True)
class TransportPayload:
    """
    Immutable container for the latent tensor plus minimal metadata for audit logs.

    Freezing the dataclass prevents accidental in-place mutation that would desync byte-size receipts
    between the sending and receiving agent in multi-threaded harnesses.
    """

    latent: torch.Tensor
    dtype_name: str
    latent_dim: int

    def byte_length(self) -> int:
        """
        Compute exact serialized byte length for README tables (torch.save uses pickle; we use raw bytes).

        Raw contiguous bytes mirror the "payload over Xn" story more honestly than pickling full tensors.
        """
        return int(self.latent.numel() * self.latent.element_size())

byte_length() is the agent-side analog of the radio paper’s “128 B payload over Xn.” It computes the actual transferable byte count from the actual tensor, no spec-sheet assumptions. The radio paper says 128 bytes because the 3GPP Xn HANDOVER REQUEST has a strict optional-IE size budget. The agent side does not have that constraint yet, but it is wired to measure whatever size it lands on, so when it does get a real network transport, the receipt will be honest.

The pack step itself is small and deliberate:

def pack(self, z: torch.Tensor) -> TransportPayload:
    """
    Detach from autograd, move to staging device, and record dtype for round-trip fidelity.

    Detaching breaks the graph on purpose: transport is a hand-off boundary where sender gradients stop.
    """
    z_staged = z.detach().to(self.device).contiguous()
    return TransportPayload(
        latent=z_staged,
        dtype_name=str(z_staged.dtype),
        latent_dim=int(z_staged.shape[-1]),
    )

detach() breaks the autograd graph on purpose. The hand-off boundary is a logical boundary, not just a memory move. Sender gradients stop at this line; the receiver builds its own graph from the unpacked latent if it wants to. The radio paper imposes the same boundary at the Xn interface, for the same reason — the source and target base stations are different processes on different machines, and a cross-process autograd graph is a bad idea regardless of industry.

Step 4 — Unpack and project to K memory tokens

def unpack(self, payload: TransportPayload, device: torch.device | str) -> torch.Tensor:
    """
    Materialize z on the receiver device before the projector lifts it into memory embeddings.

    clone() prevents aliasing if the same payload object were accidentally reused across two agents.
    """
    return payload.latent.clone().to(device, non_blocking=True)

The receiver materialises z on its own device, clone()s defensively so two receivers reading the same payload object cannot stomp on each other, then hands it to the gated MLP. The gated MLP’s forward, from src/projector/gated_mlp.py:

def forward(self, z: torch.Tensor) -> torch.Tensor:
    """
    Return shape (batch, K, D) suitable for torch.cat along the sequence dimension with token embeds.

    Applying the gate in float32 even when the LM runs in bf16 can reduce numerical noise on consumer GPUs
    that lack full-speed bf16 ALUs (Pascal-era hardware note for GTX 1080 baselines).
    """
    # Ensure z is rank-2 so batch matmul paths stay vectorized on wide tensor cores when available.
    if z.dim() != 2:
        raise ValueError("GatedLatentToMemoryProjector expects z shaped (batch, latent_dim).")
    h = self._trunk(z)
    gate = torch.sigmoid(self._gate_head(h))
    value = self._value_head(h)
    # Elementwise product suppresses spurious directions before reshaping into memory tokens.
    mem_flat = gate * value
    return mem_flat.view(z.size(0), self.num_memory_tokens, self.model_hidden)

Trunk → gate head and value head split → elementwise product → reshape to (batch, K, D). The output D is the LM’s hidden size, so each of the K memory tokens already lives in the model’s embedding space and can be concatenated alongside real token embeddings without a separate space-matching layer.

Step 5 — Concat in front of the question and greedy-decode

Now the receiver actually answers. From src/agents/harness.py:

@torch.inference_mode()
def greedy_generate_from_memory_prefix(
    lm: nn.Module,
    tokenizer,
    embed_layer: nn.Module,
    memory_tokens: torch.Tensor,
    question_prompt: str,
    max_new_tokens: int,
    device: torch.device,
) -> str:
    """
    Greedy decoding starting from concatenated soft prompts + question token embeddings.

    The loop alternates between a wide prefix forward (first step) and skinny single-token steps that
    reuse KV cache entries the way a production decoder would after a hand-off frame arrives.
    """
    q_batch = tokenizer(question_prompt, return_tensors="pt", truncation=True, max_length=512)
    q_ids = q_batch["input_ids"].to(device)
    q_embeds = embed_layer(q_ids)
    prefix = torch.cat([memory_tokens.unsqueeze(0), q_embeds], dim=1)
    attention_mask = torch.ones(prefix.shape[:2], device=device, dtype=torch.long)
    past = None
    embed_step = prefix
    generated: list[int] = []
    for _ in range(max_new_tokens):
        out = lm(
            inputs_embeds=embed_step,
            attention_mask=attention_mask,
            past_key_values=past,
            use_cache=True,
            return_dict=True,
        )
        past = out.past_key_values
        logits = out.logits[:, -1, :]
        next_id = int(logits.argmax(dim=-1).item())
        generated.append(next_id)
        next_emb = embed_layer(torch.tensor([[next_id]], device=device, dtype=torch.long))
        embed_step = next_emb
        add = torch.ones((1, 1), device=device, dtype=torch.long)
        attention_mask = torch.cat([attention_mask, add], dim=1)
    return tokenizer.decode(generated, skip_special_tokens=True)

One wide prefix forward over [memory_tokens, question_embeds], then standard one-token-at-a-time greedy decode reusing past_key_values. The receiver never tokenises Agent A’s original context. It does not have to. The information that would have come from reading that context is already present in the K memory tokens, projected directly into the receiver’s embedding space.

For comparison, the cold baseline that V1 measures against is the standard agent flow — the receiver gets the full context plus the question as input_ids and answers from text:

def _format_agent_prompt(context: str, question: str) -> str:
    """
    Build the cold-start string that forces the receiver LM to re-read the entire Agent A context.

    Keeping the instruction delimiter style stable across branches isolates the ILCP effect from prompt drift.
    """
    return (
        "You are Agent B. Read the context carefully, then answer the question with a short span.nn"
        f"Context:n{context}nnQuestion: {question}nnAnswer:"
    )

Cold path: read the whole passage, do the whole prefill, answer. ILCP path: get a latent, project it, decode from the prefix. Same task, two contracts, one of them does the second read and one of them does not. That is the V1.

5. The receipts (from the telecom paper, NOT from LLM agents)

This is the section where every previous part of this series put a benchmark table. Part 4 is the part where I have to be honest about which receipts I am actually allowed to put in front of you.

Quick note on methodology before anyone reaches for the rocks: every number in this section is from “Inductive Latent Context Persistence: Closing the Post-Handover Cold Start in 6G Radio Access Networks” (Banerjee & Awan, Nokia Munich, accepted at AI4NextG @ ICML 2026; preprint arXiv:2605.00593v2, June 2026). The paper evaluates ILCP on the Vienna 4G/5G drive-test, a multi-cell, multi-tier urban radio trace with dense cell overlap, 31 handover events in the held-out test split, per-step measurements at 100 Hz, with 1000-bootstrap 95% confidence intervals on every reported value. The inference hardware is one NVIDIA GTX 1080 (8 GB), Intel i7-8700K, 16 GB RAM — which is, conveniently, the same canonical reference GPU as the other parts of this series.

Method	Acc@t=0 (%)	HOF (%)	Ping-pong (%)	Ovr
ZK-HGT (no-transfer baseline)	87.1 (74–97)	12.9 (3–26)	6.5 (0–16)	75.6
GAT-Temporal	22.6 (10–39)	77.4 (61–90)	61.3 (45–77)	19.3
Transformer-Temporal	77.4 (61–90)	22.6 (10–39)	22.6 (10–39)	66.9
LSTM	12.9 (0–26)	87.1 (74–97)	83.9 (71–94)	6.3
3GPP A3/A5 (rule)	100.0 (100–100)	0.0 (0–0)	3.2 (0–10)	72.4
ILCP (ours)	83.9 (71–94)	16.1 (3–32)	0.0 (0–0)	74.1

These are 6G handover metrics, in 6G handover units, on a 6G handover trace. Acc@t=0 is “did the model pick the correct next serving cell at the moment of handover?” HOF is the fraction of handover events where the predicted next cell is incorrect. Ping-pong is the fraction of handovers reversed within a short window — operationally the most painful failure mode, because every reversed handover is wasted control-plane signalling on a network that already has a thousand other things to do. The ZK-HGT baseline (Zero-Knowledge HGT) is the otherwise identical architecture without ILCP — same heterogeneous graph backbone, same GRU, same candidate scorer — but with the per-user recurrent state re-initialised at every handover. ZK-HGT is the clean ablation that isolates the effect of cross-handover state persistence; ILCP is ZK-HGT plus a transferred latent.

The single most operationally important row is the 0.0% ping-pong rate for ILCP versus 6.5% for ZK-HGT and 22.6% for the Transformer baseline. In dense future deployments with overlapping small cells, ping-pongs are exactly what destroys mobility quality of service. The paper’s section 5.1 calls out that the A3/A5 rule reaches 100% accuracy on the clean trace only because the handover labels in the trace were themselves generated by an A3/A5-like rule, so that comparison is a sanity check, not a real win. Read the paper for the careful discussion of why the unperturbed A3/A5 row should be treated as a label-leakage artefact and not a number to chase.

6. “OK, but how is this different from prefix caching / RadixAttention / RAG memory?”

Reasonable question, and worth answering directly, because the inference-infra world has a lot of overlapping primitives and an HPC reader will ask this in the first comment.

vLLM prefix caching / TGI session caches. Excellent if your shared prefix is request-scoped or session-scoped. They cache the KV state inside one serving runtime so a follow-up request from the same session does not re-prefill the same prefix. They do not survive across a hand-off where Agent B is a different model, a different process, a different machine, or even just a different llama_context — the KV blob is tied to the local context. ILCP-for-agents is explicitly cross-process portable by construction, because the transported object is a learned summary in a portable latent space, not a KV blob in the engine’s private memory layout.
SwarmKV (Part 1 of this series). Closest cousin in the same author’s body of work, with a critical difference: SwarmKV fans the same KV cache out to N branches that all share one document inside one pipeline run. ILCP-for-agents goes the other direction — it persists state across hops where Agent B’s job is different from Agent A’s. SwarmKV is “compute once, fan out within a run.” ILCP-for-agents is “compress once, transfer across hops.” Together they cover both axes of the redundant-recomputation problem.
SGLang RadixAttention. Tree-shaped prefix sharing inside a serving runtime — beautiful for many requests with shared prefixes, again scoped to the runtime. Not designed to hand a portable, model-version-tolerant summary to a different process running a different specialised agent.
Retrieval-augmented generation (RAG) memory. Stores chunks of text in a vector DB and retrieves them at query time. Useful, but the unit of transfer is text — which means the receiver still has to tokenize and prefill. ILCP transfers a learned latent that the receiver consumes via inputs_embeds, skipping the tokeniser and the text-side prefill of the persisted content entirely.

One-line intuition: prefix caching is a serving-runtime trick for one user’s repeating prompt; RAG is a text database that the model still has to read every time; SwarmKV is a within-run KV fan-out; ILCP is the cross-hop architectural primitive that all of those are not, lifted from a peer-reviewed 6G paper because that industry happened to need it sooner. Different problems, complementary primitives, frequently co-deployable in the same building.

7. Plot twist — this isn’t a plot twist (the telecom anchor, said plainly)

This is the section where, in Parts 1–3, I confessed that the GPU work was secretly a telecom problem in disguise. In Part 4 it is not a confession anymore — the codebase is the disguise being lifted off the paper.

For readers without a 3GPP background, here is the one-paragraph decoder ring. In a 5G or 6G mobile network, a phone (the UE — user equipment) is being served by one base station (the gNB) at a time. When the phone moves, it eventually gets handed over to a new gNB. To do that handover well, the network needs to predict which gNB the phone should be served by next. Modern learned approaches do that prediction with a graph neural network (GNN) over the local topology and a recurrent module (typically a GRU) over the phone’s recent radio measurements. The phone’s evolving hidden state — its “is it walking? is it on a tram? is it about to lose line-of-sight?” context — lives in that GRU. At handover, that GRU state is thrown away, and the target gNB has to re-initialise the per-UE recurrent state and rebuild it from the few measurements it has just received. The paper calls this the post-handover cold start. ILCP is the protocol that fixes it: compress the source-side GRU state with a β-VAE, transport the latent over the standard 3GPP Xn interface (the inter-base-station message bus), and project it back into the target gNB’s state space at handover via a learned gated MLP. The whole thing fits in a 128-byte differential update piggy-backed on the existing HANDOVER REQUEST message.

Read that paragraph and the architecture from sections 3 and 4 of this post side by side. Tell me with a straight face these are different problems.

6G NR handover (at the gNB)	ILCP-for-agents (at the LLM)
UE measurement and mobility history at the source gNB	Agent A’s working context (pooled hidden summary)
128-dim GRU recurrent state h_u at the source gNB	4096-dim pooled hidden vector s_A at the sender
β-VAE compressor encodes h_u to a 32-dim latent	β-VAE compressor encodes s_A to a latent z
128-byte FP32 payload over the 3GPP Xn HANDOVER REQUEST	TransportPayload bytes over in-process transport (network-protocol-agnostic in V1)
Target gNB gated MLP projects the latent into target state space	Gated MLP projects z into K memory tokens in the LM embedding space
Combined via h_new = LayerNorm(decoded_h + γ ⊙ MLP([decoded_h, x_new]))	Combined via torch.cat([memory_tokens, q_embeds]) feeding inputs_embeds
Reduces post-HO cold-start gap (paper: peak +13.3 pp, avg +5.1 pp)	Reduces hand-off cold-start (agent-side measurement is roadmap, not yet shipped)
Eliminates ping-pong handovers in the test split (paper: 0.0% vs 6.5%)	Should reduce “Agent B doesn’t quite get what A meant” follow-up correction loops (agent-side measurement is roadmap)

The left column is published, peer-reviewed, measured on the Vienna 4G/5G drive-test, and accepted at the AI4NextG workshop at ICML 2026. The right column is the V1 wiring in ilcp-for-agents plus an honest “yet” next to every measurement claim. That mapping is the entire reason this post exists.

8. Honest caveats (because the comments are coming)

If you came here to find what is wrong with this project — congratulations for coming this far. To help you, from the LIMITATIONS section of the README and the inline code comments:

No agent-side numerical receipts in V1. This is the single biggest caveat and it deserves to be at the top. The agent-side ilcp-for-agents V1 ships wiring and a toy data/toy_handoff.json with five examples evaluated under a strict exact-match metric. There is no agent-side three-trial benchmark campaign, no p99 latency table, no bytes-per-payload sweep, no quality study against a real held-out QA dataset. Every numerical claim in this post comes from the 6G paper, labelled accordingly, and the agent-side numbers are explicitly on the roadmap. I am not laundering RAN receipts as LLM receipts. If you came here expecting “the V1 agent benchmark,” it does not exist yet, and pretending otherwise would betray the entire honest-receipts thesis of this series.
Lossy state. V1 moves a pooled hidden summary, not full activations or a KV tensor. The β-VAE bottleneck is on purpose, but it is a bottleneck. There is a real risk of dropped detail that matters for the receiver’s task — and the right way to discover what is being dropped is to run the agent-side benchmarks the V1 has not yet shipped. The README says this plainly.
Toy metric. Exact match against five hand-written examples is a wiring check, not an open-domain QA claim. It catches “did the model produce a string at all” and “did the harness wire the prefix correctly.” It does not catch “is the answer good.” Replacing the toy JSON with a real held-out task is roadmap item #1.
In-process transport in V1. The transport boundary is logically explicit (the TransportPayload is a deliberate pack/unpack step) but the V1 wire is torch.Tensor.detach().to('cpu'), not a real network call. A real gRPC or shared-memory ring buffer is roadmap, behind the same stable interface so call sites do not change.
Pascal + bitsandbytes. The README is explicit that NF4 4-bit loading via bitsandbytes may be unavailable or unstable on some sm_61 Pascal stacks. The fall-back is torch_dtype=torch.float16 on GPU or torch.float32 on CPU, or set ILCP_MODEL_ID to a smaller instruct model. Whichever path actually ran in your receipts run should be disclosed; the README puts that disclosure in the receipts narrative file, not in the published metric.
Frozen receiver. V1 treats the receiver LM as frozen and adapts only the projector. That is the cheapest possible bet, and it is the right starting point — but a more capable mapping might allow light receiver-side adaptation. The radio paper does the analog of receiver-side adaptation via the gated combination h_new = LayerNorm(decoded_h + γ ⊙ MLP([decoded_h, x_new])), where x_new is the target gNB’s own freshly observed context. That equation is the exact shape an LLM-side receiver adaptation should probably take, and it is on the roadmap.
Sender pooling is a single-vector summary. The masked-mean-pool over the final-layer hidden states is the approved V1 pooling. It is not the only choice — last-token pooling, attention-pooled summaries, or a small learned pooler all exist — and the V1 explicitly does not claim that masked-mean is optimal. It claims it is reproducible and easy to audit. Section 3 of the radio paper makes the analogous “single fixed-size summary per hand-off” choice for the same reason: it is the simplest contract that can be measured.

Everything on this list is on the roadmap. None of it changes the architectural claim. The point of putting it in writing is that you should not have to dig for it — and the moment a benchmark blog post hides its caveats is the moment its numbers stop being trustworthy. (Part 1 said exactly this in its own honest-caveats section. Part 4 inherits the policy verbatim.)

9. The V1 ceiling, and the series capstone

This is the final part. There is no Part 5 to defer the hard problems to. So instead of pointing forward, this section points backwards across the whole series and asks what we actually shipped.

The thesis was simple and the four-part shape held:

Part 1 — Redundant prefill. SwarmKV: run prefill once, fan the KV cache out to many branches that share the same document. The fix was “compute once, copy the bytes.” It saved 48.69% end-to-end and 98.09% of the second branch’s activation latency on the canonical GTX 1080.
Part 2 — Redundant waiting. Kube-TimeSlice-Profiler: when many agents share one GPU via Kubernetes CUDA time-slicing, the median lies and the p99 tells the truth. The fix was not “never share” — sharing is how a swarm of agents affords its silicon — it was “measure the tail and stop trusting pod phase.” The same canonical GPU, the same Qwen-class model, a small honest tool that turns “the GPU feels slow” into a degradation factor with three decimals.
Part 3 — Redundant CPU round-trips. CUDA-TopK-Retrieval: the agentic RAG retrieval hop wants to stay on the GPU. The fix was a 343-line CUDA kernel that keeps similarity + Top-K on the device, up to 8.57× faster end to end on the same canonical GPU at the K values that actually matter, with the K=100 ceiling documented honestly.
Part 4 — Redundant context rebuilds. ILCP-for-agents: agent hand-offs currently drop and rebuild context. The fix is a compress → transport → project protocol lifted, almost line for line, from a peer-reviewed 6G handover paper. V1 ships the wiring; the agent-side measurements are explicit future work. The transferable thing is the architecture, not the radio numbers.

Four parts, four kinds of redundant work, one underlying lesson: refusing to recompute beats every clever algorithm. Bin packing, paged attention, speculative decoding, MoE routing, mixture-of-depths, all genuinely impressive. None of them save you anything compared to just not doing the same work twice. Most modern systems work hard. The well-engineered ones work less.

And one more lesson, the one I did not appreciate until I sat down to write Part 4: good infrastructure ideas migrate across industries before they migrate across teams. Cell broadcast, HARQ soft-combining, OFDMA time slicing, MIMO beam codebooks, post-handover state transfer — all four of the bottlenecks this series tackled were ones the radio access network ran into first, sometimes twenty years before the LLM crowd had a name for them. Part 1 was SIB broadcast in transformer costume. Part 2 was TDMA in a Kubernetes ConfigMap. Part 3 was UE-side beam selection through a CUDA kernel. Part 4 was Xn-side state persistence through PyTorch. Different decade, different protocol stack, same problem shape.

That was the series. I am not pretending it is finished engineering — every part has a V2 worth writing, every repo has a roadmap longer than its README — but the four-part thesis is on the page, and the same canonical GTX 1080 is in the receipts at the bottom of every post. Production-minded systems engineering, validated on a budget GPU. That was always the point.

10. Wrap

If you build agentic LLM infrastructure for a living: please go look at how your multi-hop pipeline hands off between specialised agents. If the hop is a string concatenation followed by a tokenise-and-re-prefill on the receiver, you are paying the cold-start tax. The fix does not require new transformer tricks; it requires accepting that the right unit of transfer between agents is a learned summary, not a prompt string.

If you build telecom systems for a living: The 6G handover paper underneath Part 4 is the only peer-reviewed receipt in this whole project, and it sits underneath all four parts as an honest reminder that the radio world has been solving these architecture problems for twenty years while the LLM world was still arguing about prompt templates. Come on over. The compute is great, the deadlines are softer than yours, and the cold-start problem feels familiar.

If you are a beginner who has been reading this series and got all the way to Part 4: congratulations, you now understand more about why agentic AI inference is hard than 80% of the people building it for a living. You also understand why “the bottleneck this month is X” is rarely a new problem — it is almost always an old problem rotated into a new vocabulary. Go find the old problem, because probably somebody already solved it so you don’t have to.

Disclaimer: The illustrations in this article were generated using AI (Claude Opus 4.8). They are illustrative, not photographic, and any labels visible inside the images are stylized rather than authoritative — refer to the article body and the code itself for precise function names, metric values, and architecture details.

Source link

nimda 3 hours ago

0 0 30 minutes read