The Counterintuitive Networking Decisions Behind OpenAI’s 131,000-GPU Training Fabric

0 1 17 minutes read

The Counterintuitive Networking Decisions Behind OpenAI’s 131,000-GPU Training Fabric

. Accept packet loss on purpose. Spray each transfer across hundreds of random paths. If someone handed you this list of design decisions for a network connecting 131,000 GPUs, you would assume it was written by someone who had never operated a production network.

A consortium of OpenAI, AMD, Broadcom, Intel, Microsoft, and NVIDIA built exactly this — and quietly inverted three decades of consensus about how high-performance data center networks should work.

The protocol is called MRC, short for Multipath Reliable Connection. It was released on May 5, 2026 through the Open Compute Project. The accompanying research paper (Araujo et al., 2026) details its deployment across OpenAI’s largest NVIDIA GB200 supercomputers, including the Stargate site with Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater supercomputers. MRC has been used to train the latest frontier models behind ChatGPT and Codex.

What is most striking on close reading of the paper is something the press coverage has not surfaced: MRC effectively eliminates the entire Layer 3 control plane from the data center fabric. No OSPF. No BGP. No IS-IS. No FIB. The switches in the deployment maintain zero dynamic forwarding state. To the author’s knowledge, this is the most aggressive elimination of dynamic routing in any production AI training fabric publicly documented to date.

The paper’s core argument is that at 100,000+ GPU scale, tail latency from network congestion and failures dominates training performance, and the conventional networking stack cannot solve this without fundamental changes to how packets move between GPUs. MRC is those fundamental changes, implemented in 800 Gb/s NICs from three different silicon vendors and deployed in production.

What makes MRC worth studying carefully is not that it is fast. It is that the design decisions behind it contradict several principles that the networking community has treated as settled for decades. Understanding why those decisions work at this scale, and where they might not, matters for anyone building or operating AI infrastructure.

Figure 1. The failure cascade that MRC eliminates.
Left: conventional RoCE with single-path routing. A congested T1 link triggers PFC PAUSE that propagates backward, blocking GPU 2 even though its own path was clear. All 100,000 GPUs idle until GPU 2’s transfer completes. Right: MRC sprays packets across 8 independent planes. When a link fails in Plane 2, the NIC retires that entropy value and redistributes traffic to the remaining 7 planes in microseconds. No GPU ever stalls. The five numbered design decisions at the bottom are the subject of this article.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]

Each of MRC’s decisions is individually familiar to anyone who has followed networking research. The combination is what is radical. The networking community has explored every one of these ideas in isolation — multi-plane fabrics, source routing, packet spraying, lossy transports with selective retransmission, ECN as a load-balancing signal. What makes MRC worth careful study is that the OpenAI consortium committed to all of them, simultaneously, in production at 131,000 GPUs.

The problem: one straggler blocks 100,000 GPUs

Synchronous pretraining runs in lock-step. Every training step involves millions of data transfers across thousands of GPUs performing a combination of tensor parallelism, pipeline parallelism, data parallelism, and expert parallelism. The step cannot advance until the slowest transfer completes. At 100,000 GPUs, the duration of each communication round is determined by the tail of the transfer latency distribution, not the mean.

The paper frames this precisely: “As computations scale, communication becomes increasingly outlier-dominated.” A single congested link, a single flow collision, a single switch buffer overflow can stall thousands of GPUs for milliseconds. At the hourly cost of 100,000 H100-class GPUs (roughly $300,000 per hour at cloud rates), a 10-millisecond stall that occurs once per training step and repeats across thousands of steps is not a rounding error. It is a line item.

Network failures compound the problem. At this scale, link flaps, optic failures, and switch reboots are not rare events. They are statistical certainties that occur multiple times per day across a fabric with hundreds of thousands of links. The paper reports a production incident where an optical transceiver on a T0 switch “suffered a glitch, and flapped all its four links in rapid succession,” affecting three active training nodes simultaneously. In a conventional network, this would have crashed the training job.

MRC’s design goal was not just higher bandwidth. It was predictable bandwidth, even in the presence of failures, with a control plane simple enough that a small team can manage multiple supercomputers simultaneously.

The topology: 131,000 GPUs in two switch tiers

The first design decision is architectural, not protocol-level. Instead of treating an 800 Gb/s NIC as one fat pipe, MRC splits it into eight 100 Gb/s links, each connecting to a different switch. This creates eight parallel network planes, each operating independently.

Consider a conventional approach. Today’s fastest datacenter Ethernet switches offer 51.2 Tb/s of switching capacity, yielding 64 ports at 800 Gb/s. In a standard fat-tree Clos topology, each Tier-0 (T0) switch connects down to 32 NICs and up to 32 Tier-1 (T1) switches. Each T1 switch connects to 64 pods. That gives you a 3-tier network supporting roughly 64,000 GPUs at full bisection bandwidth. To reach 100,000, you need a fourth tier, which adds latency, cost, and failure domains.

Now split the NIC. The same 51.2 Tb/s switch at 100 Gb/s per port gives you 512 ports instead of 64. Each T0 switch connects down to 256 NIC ports and up to 256 T1 switches. Each T1 connects to 512 T0s. A single two-tier plane supports 131,072 GPUs at full bisection bandwidth.

The paper quantifies the savings:

Conventional 3-tier (800 Gb/s):
  - 3 switch tiers, 64-port switches
  - Max ~64K GPUs at full bisection BW
  - 5-hop or 7-hop worst-case path

Multi-plane 2-tier (8 × 100 Gb/s):
  - 2 switch tiers, 512-port switches
  - 131K GPUs at full bisection BW
  - 3-hop worst-case path
  - 2/3 the optics of a 3-tier network
  - 3/5 the number of switches

Figure 2. Conventional 3-tier fat-tree vs MRC 2-tier multi-plane topology. Both use the same 51.2 Tb/s switch silicon. The conventional approach configures 64 ports at 800 Gb/s, requires three tiers, and maxes out at roughly 64,000 GPUs. MRC splits each NIC into 8 × 100 Gb/s links, creating 8 independent two-tier Clos fabrics that support 131,072 GPUs with fewer switches and fewer optics. The red dashed line on the left traces the worst-case 7-hop path. On the right, every path crosses at most 3 hops.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]

The resilience math is equally compelling. Losing a single NIC-to-T0 link in an 800 Gb/s single-plane network costs 3% of that NIC’s bandwidth. In a 100 Gb/s multi-plane network, the same failure costs 0.4%. More importantly, with eight independent planes, the NIC can continue operating on the remaining seven while the failed link is repaired. The training job does not need to stop.

This tradeoff is not free. Eight separate planes mean eight times as many links to monitor, eight times as many potential failure points in aggregate, and a transport protocol that must load-balance intelligently across all of them. That is where MRC itself comes in.

Packet spraying with entropy values

Conventional RDMA transports (RoCEv2, InfiniBand RC) pin each connection to a single network path. The path is selected by hashing the flow’s five-tuple (source/destination IP, source/destination port, protocol) at each switch. Once pinned, every packet in that connection follows the same path until the connection is torn down.

This works at moderate scale. It fails at 100,000+ GPUs because of flow collisions. When two connections hash to the same path through the same bottleneck link, both suffer. The probability of collision increases with scale, and the tail latency impact is disproportionate.

MRC eliminates flow pinning entirely. Instead, it assigns each Queue Pair (QP) a set of 128 to 256 entropy values (EVs) at connection setup. Each EV encodes a specific path through a specific network plane. The sender rotates through its EV set packet by packet, spraying consecutive packets across hundreds of different paths across all eight planes. No two consecutive packets from the same transfer take the same route.

The EV is a 32-bit value split across the UDP source port and the IPv6 flow label in each MRC packet. Switches hash on these fields, so changing the EV changes the path. The sender does not need to know the topology. It only needs to know that different EVs produce different paths.

Per-QP state:
  EV set: 128-256 entropy values (32-bit each)
  Per-EV health: {active, congested, suspected_failed, confirmed_failed}

Packet sending:
  for each packet in transfer:
    ev = next_active_ev(qp.ev_set)
    packet.udp_src_port = ev[0:15]
    packet.ipv6_flow_label = ev[16:31]
    send(packet)

Each EV carries a few bits of health state. When the receiver detects congestion on a path (via ECN marking from switches), it echoes this back to the sender, which temporarily avoids that EV. When a packet is actually lost (not trimmed), MRC assumes the path has failed and immediately stops using that EV. Background probes periodically test retired EVs to determine whether the failure was transient, resurrecting them if probes succeed.

The load-balancing quality of this scheme is high. Because different senders independently generate random EV sets, the aggregate traffic distribution across paths is near-uniform. Small imbalances are smoothed by the ECN feedback loop: if one path accumulates slightly more traffic, ECN marks increase on that path, and senders redistribute to less-loaded alternatives.

Figure 3. Lifecycle of a single entropy value. Each of the 128-256 EVs per Queue Pair independently tracks its path health through a four-state machine: Active (normal), Congested (ECN received, temporarily avoided), Failed (packet lost, retired), and Probing (background probes testing whether the path has recovered). The full cycle runs in 1-2 milliseconds. Compare this to the 1-30 seconds required for conventional dynamic routing to converge — a factor of 1,000 to 30,000× faster.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]

Figure 4. Entropy-value-based packet spraying. A single RDMA write transfer from GPU 42 to GPU 9127 sprays 8 consecutive packets across all 8 network planes. Each packet carries a different 32-bit entropy value (split across UDP source port and IPv6 flow label), which causes switches to hash it onto a different path. Packets arrive out of order at the receiver, which writes each one directly to its RDMA virtual address in HBM without a reorder buffer. No two consecutive packets share a route, eliminating flow collisions by design.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]

Static source routing with SRv6

This is the most counterintuitive decision in the paper. Every production datacenter network runs dynamic routing protocols (BGP, OSPF, IS-IS) that compute forwarding tables, react to topology changes, and converge after failures. MRC disables all of them.

Instead, MRC uses IPv6 Segment Routing (SRv6) to encode the full path each packet should take. The sender embeds the sequence of switch identifiers directly into the packet’s destination address. Each switch along the path checks if its identifier is present, removes it by shifting the address, and forwards to the next hop. No routing table lookup. No forwarding information base. No control plane convergence.

The paper explains the logic: “We took the unusual position of disabling dynamic routing in the switches because we didn’t want two adaptive routing mechanisms interacting with each other and dynamic routing wasn’t adding anything.”

MRC’s transport-layer adaptation (EV management, ECN feedback, path probing) already handles failures at microsecond timescales. Dynamic routing protocols converge in seconds to minutes. Running both creates a risk of conflicting decisions: MRC avoids a failed path at the transport layer while the routing protocol is still converging to a new forwarding state, potentially creating routing loops or oscillations.

By removing dynamic routing entirely, MRC gets three operational benefits:

First, deterministic forwarding. Every packet follows a known, pre-computed path. If something goes wrong, you can trace exactly which switches the packet traversed. The paper notes that this “gives us very good observability” because the path is encoded in the packet itself.

Second, eliminated convergence failures. Dynamic routing protocols can misconfigure, loop, or partition the network during convergence. With static SRv6 routes, these failure modes do not exist. The switches are stateless packet forwarders.

Third, simplified operations. The paper emphasizes that “very small teams of people need to be able to manage the networks of multiple supercomputers.” Removing routing protocols removes an entire category of operational complexity, configuration drift, and debugging surface area.

The tradeoff is that path computation moves to the NIC. The MRC NIC must know enough about the topology to generate valid SRv6 paths for its EV set. In OpenAI’s deployment, this is handled at QP setup time using a simple topology database. The paths are static and pre-computed. Runtime adaptation happens at the EV selection level, not at the routing level.

Figure 5. SRv6 source routing. The sender NIC encodes the full switch path as a sequence of segment identifiers in the IPv6 destination address. At each hop, the switch reads its own identifier, shifts the address left, and forwards to the next segment. No routing table lookup, no forwarding information base, no control plane convergence. The contrast panel shows that MRC switches maintain zero forwarding state, compared to 100,000+ FIB entries in a conventional switch running OSPF or BGP. Path computation happens once at QP setup time; runtime adaptation happens at the entropy value layer.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]

Running lossy: why MRC disables PFC

This is the decision that will surprise most networking practitioners. RDMA networks have traditionally relied on Priority Flow Control (PFC) to create lossless Ethernet fabrics. When a switch buffer fills, PFC sends a pause frame upstream, preventing the sender from transmitting until the buffer drains. InfiniBand has a similar credit-based flow control mechanism. The entire “lossless fabric” paradigm exists to support RDMA’s assumption that packets do not get dropped.

MRC explicitly disables PFC and runs on standard best-effort (lossy) Ethernet.

The reason is head-of-line blocking. When a PFC pause frame fires on one port, it can block traffic destined for other ports that share the same ingress buffer. In a large training cluster running multiple collectives simultaneously, a PFC pause triggered by one collective’s incast can delay transfers from a completely unrelated collective. This cross-collective interference creates exactly the tail latency outliers that MRC is designed to eliminate.

The paper’s solution is a combination of three mechanisms:

First, selective retransmission. MRC tracks which packets have been received using Selective ACKs (SACKs). When loss is detected, only the missing packets are retransmitted, not the entire window. This is faster than go-back-N retransmission used in some RoCE implementations.

Second, packet trimming. When a switch would drop a packet due to buffer overflow, it instead trims the payload and forwards just the header as a priority packet. The receiver gets the trimmed header, recognizes the gap, and sends a NACK to trigger immediate retransmission. This eliminates the timeout delay between loss detection and retransmission. It also lets MRC distinguish between congestion drops (trimmed packets) and link failures (no packet at all), enabling different recovery strategies for each.

Third, out-of-order memory placement. Every MRC data packet carries the RDMA virtual address and remote key. The receiving NIC can write each packet directly to its final memory location regardless of arrival order. This is critical because packet spraying across hundreds of paths guarantees that packets will arrive out of order. Without direct placement, the receiver would need reorder buffers, adding latency and memory overhead.

Figure 6. PFC lossless vs MRC lossy recovery under incast congestion. Top: with PFC enabled, Collective B’s incast fills a shared switch buffer, triggering PAUSE frames that propagate upstream and block Collective A, which was not congested. This is head-of-line blocking: one collective’s congestion stalls an unrelated collective. Bottom: MRC disables PFC entirely. The switch trims excess packets (dropping payload, forwarding headers as NACKs), and the sender selectively retransmits only the missing packets via SACK. Collective A is never affected. Recovery takes microseconds, not milliseconds.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]

ECN repurposed: load balancing, not congestion control

In conventional networks, Explicit Congestion Notification (ECN) signals congestion to the sender, which responds by reducing its transmission rate (similar to TCP congestion control). MRC repurposes ECN entirely.

In MRC’s multi-plane topology with full bisection bandwidth, aggregate congestion should not exist under normal operation. The total available bandwidth exceeds the total demand. What does exist is local path imbalance: some paths may be slightly more loaded than others due to the random EV selection across different senders.

MRC uses ECN as a per-path load signal. Switches mark packets with ECN in the standard randomized manner, but MRC disables ECN marking on the last hop to the receiver (to avoid confusing last-hop incast with fabric congestion). The receiver echoes ECN marks back to the sender, tagged with the specific EV that was marked. The sender then temporarily avoids that EV, shifting traffic to less-loaded paths.

This transforms ECN from a rate-control mechanism into a routing-level load-balancing signal. The sender does not slow down. It redirects. The distinction matters because reducing rate wastes GPU time (the transfer takes longer), while redirecting maintains throughput while smoothing out imbalances.

Figure 7. ECN repurposed. Top: in conventional RoCE, an ECN echo causes the sender to reduce its transmission rate — slowing the transfer and forcing the GPU to wait. Bottom: in MRC, an ECN echo is tagged with the specific entropy value that experienced congestion. The sender simply stops using that EV and shifts traffic to the other 255 EVs in its pool. The transmission rate never drops. The same congestion signal becomes a routing hint instead of a back-pressure signal.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]

What the production evidence shows

The paper reports results from two contexts: production frontier model training and controlled testbed experiments.

In production, MRC allowed training jobs to ride out network failures that previously would have crashed the job. The paper describes the optical transceiver glitch mentioned earlier: four links flapped in rapid succession on three active training nodes. MRC detected the path failures, stopped using the affected EVs, and redistributed traffic across remaining paths. The training job continued without interruption. In a conventional RoCE deployment, this event would have triggered PFC storms, NCCL timeouts, and a job restart costing hours of GPU time.

The testbed experiments quantify MRC’s performance characteristics:

Point-to-point bandwidth: MRC achieves near-line-rate throughput on 800 Gb/s links with packet spraying. The paper reports comparison with standard RoCE showing MRC’s advantage under multi-path scenarios.

Link failure recovery: when a link goes down, MRC detects it and redistributes traffic in tens of microseconds. No sender-side timeouts. No routing protocol convergence. The EV that mapped to the failed path is retired immediately, and the remaining EVs absorb the traffic.

Load balancing across EVs: the paper measures traffic distribution across planes and paths, showing near-uniform utilization under production workloads.

NCCL collective performance at scale: the paper evaluates MRC’s performance on all-reduce operations, which are the dominant communication pattern in data-parallel training. MRC’s packet spraying eliminates the flow-collision problem that degrades all-reduce performance at scale with conventional ECMP hashing.

The operational evidence supports the static routing decision. The paper reports that T1 core switches were rebooted during active training runs without disrupting the job. In a conventional network with dynamic routing, rebooting a core switch triggers reconvergence across the fabric. With static SRv6, the switch simply reloads its static forwarding state and resumes. MRC’s transport layer handled the temporary loss of paths through that switch by redistributing traffic to other planes.

Figure 8. Failure recovery comparison. Top: conventional RoCE with dynamic routing. A link failure triggers OSPF/BGP reconvergence across all switches, taking 1-30 seconds. During this time, all 100,000 GPUs are idle, NCCL timeouts become likely, and the training job may need to restart. At $300K/hour for a 100,000-GPU cluster, each second of idle time costs $83. Bottom: MRC with static SRv6. The NIC detects the loss via SACK within microseconds, retires the affected entropy value, and redistributes traffic to the remaining planes. No routing protocol needs to converge. The timescale is zoomed 1,000× to show the microsecond-level response. The training step completes without interruption.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]

Where these design decisions are strongest

MRC was designed for a specific workload profile: synchronous pretraining with all-reduce dominated communication, running on a single-tenant fabric with full bisection bandwidth. Within these constraints, the three design decisions are well-matched to the problem:

Static routing works because the topology is fixed and known at deployment time. Training clusters do not add or remove switches during a training run. The failure modes are link-level (handled by MRC’s EV management), not topology-level (which would require routing protocol reconvergence).

Lossy Ethernet works because the selective retransmission and packet trimming mechanisms recover faster than PFC pause frames propagate. The cross-collective head-of-line blocking that PFC creates is more damaging to tail latency than the occasional retransmission overhead.

ECN-as-load-balancing works because the multi-plane topology provides full bisection bandwidth, ensuring that aggregate congestion does not occur. Local imbalances are the only congestion source, and ECN-guided EV avoidance is a precise, low-overhead mechanism for smoothing them.

Figure 9. How MRC works at the GPU level. The GPU node (left) consists of a Blackwell-class GPU plus 192 GB of HBM3e, connected to the MRC NIC via PCIe Gen5 or NVLink-C2C. The MRC NIC contains four key modules: transport (QP + SACK), SRv6 path encoder, EV manager (256 EVs per QP), and the RDMA engine. The NIC’s single 800G port is broken out into 8 × 100G links, one per network plane. A typical MRC packet (right top) carries the SRv6 path, the entropy value, the RDMA header (vaddr + rkey), the sequence number, and the payload. The 4-step data flow (right middle) shows how a collective operation becomes a sprayed write across 8 planes. At the receiver (right bottom), packets arrive out of order and write directly to their target HBM offsets with no reorder buffer. The GPU never sees the network — it issues memory operations, and the NIC handles everything else.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]

The boundary conditions: where MRC works and where it doesn’t

MRC is a production-proven protocol for its target workload. The natural questions for the broader AI infrastructure community concern the boundary conditions.

First, multi-tenancy. OpenAI’s training clusters run a single training job at a time across the full fabric. Most cloud providers and enterprise deployments share GPU clusters across multiple workloads. MRC’s static routing assumes a stable topology database at the NIC level. In a multi-tenant environment where workloads are dynamically placed, the topology visible to each NIC changes frequently. Whether MRC’s path-generation logic adapts to this or requires modifications is an open engineering question.

Second, inference workloads. MRC was designed for synchronous training’s all-reduce communication pattern: large bulk transfers between known sets of GPUs. Inference workloads, particularly disaggregated inference with KV cache transfers between prefill and decode pools, have a different communication profile: smaller transfers, point-to-point rather than collective, and latency-sensitive at the individual request level rather than the aggregate step level. Packet spraying across hundreds of paths adds jitter to individual transfer latency, which may or may not matter depending on the SLO requirements.

Third, oversubscribed networks. MRC’s ECN-as-load-balancing mechanism relies on full bisection bandwidth. In oversubscribed networks (common in cloud environments where cost optimization drives topology decisions), aggregate congestion is real, not just local imbalance. ECN would need to function as a genuine congestion signal in this case, which changes MRC’s flow control dynamics.

Fourth, interoperability. MRC is currently implemented in specific NIC silicon (NVIDIA ConnectX-8, AMD Pollara/Vulcano, Broadcom Thor Ultra) and specific switch platforms (NVIDIA Spectrum-4/5, Arista EOS on Broadcom Tomahawk 5). The OCP release of the specification enables broader implementation, but silicon-level protocol support takes 12-18 months to develop and validate. Near-term adoption will be limited to organizations using these specific hardware platforms.

These are not criticisms of MRC. They are the engineering questions that arise naturally when a protocol designed for a specific, well-defined environment meets the diversity of the broader infrastructure market. The fact that MRC solved the tail latency problem at 131,000-GPU scale is a genuine achievement. The question for the rest of the community is which of its design decisions generalize and which are specific to the constraints of single-tenant, full-bisection-bandwidth training fabrics.

What MRC signals about the future of AI networking

MRC represents a broader shift in how AI infrastructure thinks about networking. The conventional approach treats the network as a transparent pipe: packets go in one end and come out the other, and the transport protocol’s job is to fill the pipe as efficiently as possible. MRC treats the network as a managed resource with observable, per-path health signals that the transport protocol actively exploits.

This is not a new idea in networking research. Multipath TCP, Valiant load balancing, and ECMP have explored variations of it for years. What is new is the scale at which MRC operates, the aggressiveness of its design decisions (no PFC, no dynamic routing, full packet spraying), and the production evidence that it works on the largest AI training clusters in the world.

For networking practitioners, MRC validates a thesis that has been debated for a decade: at sufficient scale, endpoint intelligence beats network intelligence. Making the NIC smarter and the switch simpler produces a more resilient system than making the switch smarter and the NIC simpler. Whether you agree with every design decision or not, the production evidence from OpenAI and Microsoft makes this argument harder to dismiss than it was a week ago.

The MRC specification is available through OCP under an open license. The research paper provides detailed experimental results. For anyone building GPU clusters at scale, both are worth reading carefully. The three rules MRC breaks might be the same three rules holding your network back.

Source link

nimda 2 hours ago

0 1 17 minutes read