OpenAI Introduces MRC (Multipath Reliable Communication): A New Open Communication Protocol for Large Clusters to Train AI Computers

0 3 5 minutes read

OpenAI Introduces MRC (Multipath Reliable Communication): A New Open Communication Protocol for Large Clusters to Train AI Computers

Training frontier AI models is not just a computer problem – it is increasingly a network problem. And OpenAI recently launched its solution.

OpenAI announced the release of the MRC (Multipath Reliable Connection)a novel communication protocol developed two years ago in collaboration with AMD, Broadcom, Intel, Microsoft, and NVIDIA. The data was published through the Open Compute Project (OCP), which enables the wider industry to use and build upon it.

Why Networking is the Hidden Bottleneck in AI training

To understand why MRC is important, you need to understand what happens inside a supercomputer during model training. When training large AI models, a single step can involve many millions of data transfers. One late transfer can shake up the entire process, potentially causing GPUs to sit idle.

Network congestion, link, and device failures are the most common sources of delay and jitter in transmission – and these problems become more frequent, and harder to solve, as the cluster size grows. This is the converged infrastructure challenge that OpenAI sets out to address.

According to OpenAI, more than 900 million people use ChatGPT every week. Supporting and developing those models at that scale means that every second of GPU idle time represents a real cost and power loss. OpenAI states its goal as “not just building a fast network, but also building one that delivers predictable performance, even in the face of failure, to keep training jobs running.”

What the MRC actually does: Three Main Approaches

MRC is not an invention. It extends RDMA over Converged Ethernet (RoCE) – an InfiniBand Trade Association (IBTA) standard that enables hardware-accelerated remote direct memory access between GPUs and CPUs. It uses techniques developed by the Ultra Ethernet Consortium (UEC) and extends them with an SRv6-based source channel to support large-scale AI communication fabrics.

RoCE is a protocol that allows one machine to read or write the memory of another machine directly through an Ethernet network, bypassing the CPU for maximum effect. SRv6 (Routing Component over IPv6) takes this a step further — the sending machine writes the exact route the packet should follow directly within the packet header, so switches no longer need to use complex routing calculations. This reduces the processing load on the switches and saves energy — something that makes sense at the scale of a data center.

1. Adaptive Packet Spraying to Eliminate Congestion

Instead of sending each transmission through a single network path, MRC broadcasts packets to hundreds of paths simultaneously, reducing congestion in the network core. With traditional RoCEv2, packets were stuck on a single path from point A to point B, contributing to congestion. To overcome this, MRC introduced Intelligent Packet-Spray Load Balancing, so that if a packet path is not working, packets can be routed through other paths in the network. This allows for higher bandwidth utilization, reduced tail delay, and finer load balancing at the packet level.

2. Microsecond Level Failure Detection with SRv6 Static Source Routing

If network paths, links, or switches fail, MRC can detect the problem and navigate to it on a microsecond time scale. Standard network fabrics can take seconds or tens of seconds to stabilize after a failure. An important architectural decision makes this possible: switches don't need to recalculate routes or do anything other than blindly follow the static routes they've been configured with. All routing intelligence resides at the NIC level, not the switch level. This is a deliberately unusual design – it disables dynamic routing on switches entirely to prevent two dynamic paths from crossing each other.

Before MRC, if the link between the GPU network interface and the tier-0 switch failed, the training task would not succeed. With MRC, work survives with proper functioning. If an 8-port network interface loses one port, the maximum rate is reduced by eight. The MRC detects this, recalculates routes to avoid the failed flight, and immediately tells peers not to use that flight for inbound traffic. Most failed links recover within a minute, at which point the MRC returns the aircraft to service.

3. Multi-Airline Networks with Fewer Interchange Branches and Low Costs

This is where the MRC fundamentally changes the structure of the cluster. Instead of treating each network as a single 800Gb/s link, it is split into many smaller links. For example, one interface can connect to eight different switches. A switch that can connect 64 ports at 800Gb/s can instead connect 512 ports at 100Gb/s. This allows to build a fully connected network of about 131,000 GPUs with only two switching stages. A typical 800Gb/s network will require three or four layers.

A combination of continuous savings: the research team estimates that to obtain the full bandwidth of two phases, a two-phase multi-plane design requires 2/3 the optics and 3/5 the number of switches compared to a three-phase network. Fewer switch tiers also mean lower latency – the longest path goes through only three switches rather than five or seven – and a smaller radius of explosion when any one component fails.

Hardware: Which NICs and Switches Use MRC

According to the research paper, MRC is already working in production on some hardware, called. It is used on all 400 and 800Gb/s RDMA NICs – including NVIDIA ConnectX-8, AMD Pollara, AMD Vulcano, and Broadcom Thor Ultra – with support for the SRv6 switch on NVIDIA Spectrum-4 and Spectrum-5 (using Cumulus and SONiC) and Broadcom Tomahawk EOS 5. On the control side of the protocol, AMD has contributed an algorithm NSCC congestion, now part of the UEC Congestion Control specification, and semantic layer extensions for IB/RDMA transport that allow MRC to be integrated with existing RDMA programming models while adding multipath capabilities that set it apart from traditional transport.

Already Produced: From Stargate to Fairwater

MRC is not just a prototype. It is already deployed on all OpenAI NVIDIA GB200 supercomputers used to train boundary models, including the Oracle Cloud Infrastructure (OCI) site in Abilene, Texas, and on Microsoft's Fairwater supercomputers. MRC has been used to train many OpenAI models, running on hardware from NVIDIA and Broadcom. Microsoft's Fairwater supercomputers are located in Atlanta and Wisconsin.

MRC has been used to train large language models ChatGPT and Codex. During the training of the latest frontier model, OpenAI had to restart four tier-1 switches. With MRC, the company did not need to coordinate resumes with teams conducting training activities in the group.

Key Takeaways

OpenAI Introduces MRC – OpenAI has partnered with AMD, Broadcom, Intel, Microsoft, and NVIDIA to release MRC (Multipath Reliable Connection) through the Open Compute Project (OCP).
Packet Spray Kills Congestion – MRC broadcasts packets to hundreds of paths simultaneously, eliminating core congestion and reducing tail latency during large GPU training.
Microsecond Failure Recovery – MRC detects link and switch failures and redirects traffic in microseconds, keeping training operations alive due to failures that would have caused complete termination of operations.
Two-Tier Topology for 131,000+ GPUs — By splitting 800Gb/s links into eight 100Gb/s planes, MRC supports supercomputers with more than 100,000 GPUs that use two switching layers instead of three or four.
Already used for ChatGPT and Codex — MRC is already deployed on all OpenAI NVIDIA GB200 supercomputers and used to train large language models for ChatGPT and Codex.

Check it out Paper again Technical details. Also, feel free to follow us Twitter and don't forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us

Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.

Source link

nimda 3 weeks ago

0 3 5 minutes read