Generative AI

Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture That Rethinks How LLMs Are Used at Scale

For years, the way major language models handle inference has been stuck inside the box – literally. The high-bandwidth RDMA networks that enable modern LLM deployments cover both prefill and trim in the same data area, sometimes even the same rack. A team of researchers at Moonshot AI and Tsinghua University make the case that this barrier is about to break down — and that the right architectures can already exploit that shift.

The research team presents Prefill-as-a-Service (PrfaaS), a cross-datacenter offering architecture that selectively fills long-distance content in independent, computer-dense filling clusters and transmits the result of KVCache over Ethernet equipment to local PD clusters to designate. The result, in a study using a 1T internal parameter hybrid model, is 54% higher in performance than the equivalent PD baseline and 32% higher than the fuzzy random setup – while consuming only a fraction of the available cross-datacenter bandwidth. The research team notes that compared to the equivalent hardware cost, the output gain is about 15%, indicating that the overall gain of 54% comes in part from pairing the high-end H200 GPUs for pre-completion with the H20 GPUs for decoding.

Why Existing Architecture Has Hit a Wall

To understand what PrfaaS solves, it helps to understand why the LLM service is divided into two categories in the first place. Prefilling is the step where the model processes all input tokens and generates a KVCache – it's very compact. Decode is where the model generates one output token at a time – memory-bandwidth-intensive. Prefill-decode (PD) separation separates these two stages on separate hardware, which improves performance and allows each stage to be developed independently.

The problem is that separating prefill from decoding creates a transport problem. If prefilling runs on one set of machines and code is generated on another, the KVCache generated by prefilling must be transferred to the code side before it starts generating output. For typical dense models – those that use Grouped Query Attention (GQA) – this KVCache is much larger. The research team benchmarks MiniMax-M2.5, a representative dense model with GQA, which produces KVCache at around 60 Gbps for a 32K token request on a single 8×H200 instance. That volume of data requires RDMA-class communication to be transmitted without computer downtime, which is why the standard PD classification is tightly bound to the network fabric of a single data rate scale. Moving pre-installation and recording to separate clusters, let alone data warehouses, was not possible.

Mixed Attention Changes Statistics

What makes PrfaaS timely is the architectural changes that occur at the model level. A growing class of models – including Kimi Linear, MiMo-V2-Flash, Qwen3.5-397B, and Ring-2.5-1T – adopt mixed attention stacks that separate a small number of full attention layers with a large number of linear-complexity or bounded-state layers such as Kimi Delta-Attention, Multi Delta Attending MDA), Multiple Attention Attention (SWA). In these architectures, only the full attention layers generate a KVCache that scales with sequence length. Complex linear layers maintain a constant iterative size when their trace is neglected in the long context.

The KV throughput numbers – defined as the KVCache size divided by the prefetch delay – tell the story clearly. For 32K tokens, MiMo-V2-Flash yields KVCache at 4.66 Gbps against 59.93 Gbps for MiniMax-M2.5, a 13× reduction. Qwen3.5-397B reaches 8.25 Gbps compared to 33.35 Gbps of Qwen3-235B, a 4× reduction. For the Ring-2.5-1T in particular, the paper rots the savings: MLA offers about 4.5× compression over GQA, and the 7:1 hybrid ratio offers another reduction of about 8×, yielding a KV memory savings of about 36×. In the 1T indoor model used in the case study, the KV throughput for 32K tokens is just 3.19 Gbps – the rate that modern inter-datacenter Ethernet links can support.

But the research team is careful to make an important distinction for AI devs to build real systems: a small KVCache is necessary but not sufficient to make data-opposite PD classification work. Real-world workloads are bursting, request lengths are skewed, prefix caches are unevenly distributed across locations, and inter-cluster bandwidth is fluctuating. A naive design that moves all prefill to the remote cluster still runs into congestion and unstable queuing.

What PrfaaS Actually Does

The PrfaaS-PD architecture rests on it three subsystems: computer, network, again storage. The compute subsystem divides the clusters into two types – local PD clusters that handle end-to-end processing of short requests, and PrfaaS clusters with high compute-throughput accelerators dedicated to long content pre-filling. The network subsystem uses intra-cluster RDMA for fast local transmission and Ethernet for the combined KVCache transport assets. The storage subsystem creates a distributed mixed-start cache pool that handles multi-level attention (request-level, fixed-size, exact match only) and full-attendance KVCache blocks (block-level, sequentially increasing with input length, supports partial start matching) in different groups supported by the pooled block pool.

The key routing method is length-based threshold routing. Allow l specifies the incremental fill length of the request after removing any cached prefixes, and t route limit. If l > tthe request goes to the PrfaaS cluster and its KVCache is sent via Ethernet to the decoding node. If l ≤ tresides in the local PD path. In the case study, the correct limit is t = 19.4K tokens, which deliver about 50% of all requests – long – to the PrfaaS cluster.

Making an Ethernet path operationally reliable requires more than just a low KV output. The research team specifies three practical transport methods: layer-wise pre-filling of pipes to switch to KVCache generation with forwarding, multi-connection TCP transport to make full use of available bandwidth, and scheduler-integrated congestion monitoring to detect early signal loss and retransmission and prevent congestion.

On top of this, the research team introduced a dual scale editor. At short intervals, it monitors PrfaaS usage and queue depth, adjusting the route when the link approaches its bandwidth ceiling. It also handles cache-affine routing: when bandwidth is scarce, each cluster's starting cache is checked independently; if the bandwidth is high, the scheduler considers the best start cached in all clusters and performs cross-cluster cache transfers if it reduces unnecessary computation. On longer time scales, the scheduler rebalances preemption and determines node numbers within the local PD cluster as traffic patterns change, keeping the system close to the optimal operating point.

Numbers

In the case study, a PrfaaS cluster of 32 H200 GPUs is paired with a local PD cluster of 64 H20 GPUs, connected by a VPC network that provides approximately 100 Gbps of cross-cluster bandwidth. The aggregated PrfaaS egress load under full configuration is about 13 Gbps – just 13% of available Ethernet capacity – and the paper notes that the PrfaaS cluster remains computer bound with large bandwidth headroom remaining. Research also shows this in large-scale applications: even at the scale of a 10,000-GPU datacenter, the total egress bandwidth required for KVCache transfers only includes 1.8 Tbps, within the capacity of modern inter-datacenter links.

Mean Time to First Token (TTFT) decreased by 50% and P90 TTFT decreased by 64% compared to the equivalent base. The fuzzy heterogeneous configuration – all filling starts in H200, all is decoded in H20, with no sense of routing or planning – achieves only 1.16× output over the homogeneous base, compared to 1.54× for the full PrfaaS-PD system. The gap between 1.16× and 1.54× separates the contribution of the scheduling layer and shows that it accounts for the greater practical gain.

The research team positions PrfaaS as a future concept but as a viable design today for hybrid-architecture models – and says that as core windows grow, KVCache compression techniques mature, and class-specific hardware such as NVIDIA's Rubin CPX prefill and LPU-style decode chips become more widely available, the PD case will become more widely available.


Check it out Paper here. Also, feel free to follow us Twitter and don't forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us


Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button