Breaking the Host Memory Bottleneck: Peer Direct Transformation Gaudi's Cloud Performance

nimda February 25, 2026

0 7 7 minutes read

Breaking the Host Memory Bottleneck: Peer Direct Transformation Gaudi's Cloud Performance

launched Gaudi accelerators on Amazon's EC2 DL1 instances, we faced a challenge that threatens all deployments. The performance numbers weren't just disappointing; they were a disaster. Models that needed to be successfully trained saw their performance drop by up to 50% when measuring in multiple locations. The problem? A network topology that moves every byte of data to the host's memory, causing a bottleneck that undermines everything Gaudi designed to do.

I led an engineering effort to address this issue, which eventually led to the development of what we now call Peer Direct. It's a feature that has revolutionized the way Gaudi accelerators communicate in cloud environments, and its history has some useful lessons for the training of distributed AI at scale.

Problem with Host NICs

Gaudi was designed with a NIC (Network Interface Card) embedded directly in silicon. Each chip has ten network connections that can manage 100 Gbps and supports RDMA with RoCE v2, which allows devices to access each other's memory directly without requiring a CPU or This structure works very well with the workload of AI training, where collective tasks such as AllReduce need to accumulate gradients from tens or hundreds of devices with training iterations.

But cloud deployments don't always come with perfect infrastructure. When Amazon tested Gaudi in DL1 scenarios, it had to use standard host NICs instead of using Gaudi's built-in network. The reasons were logical: to save costs and applications regarding the existing data center infrastructure to accommodate the new network topology. From their business perspective, using an established network infrastructure makes perfect sense.

From an operational standpoint, it was a disaster. Instead of peer-to-peer RDMA transmission between Gaudi cards, all communication went long distance. Data had to be copied from Gaudi's high-bandwidth memory into host DRAM, processed by the host CPU, sent to the host NIC via TCP/IP, received by the remote host, and sent back to Gaudi's remote memory. All the extra hops caused latency, stole CPU cycles, and added bandwidth constraints that completely destroyed the robustness of distributed training.

The lack of performance was so bad that one questioned whether the deployment would ever be worth it. This was not a matter of small improvements; it was an existential threat to the entire system with AWS.

Why Performance Is So Important

It is good to know why a 50% performance loss is a disaster in the life of training models, and especially large models like the GPT-5. Now it takes weeks or even months to train large language models even on mock clusters. When you run errands with models that have billions or billions of parameters, each percentage point of performance translates directly into time and dollars.

Think about economics. If it takes 30 days to train a model versus 15, you're not just waiting longer; you pay double the computer time. At cloud scale, with hundreds or thousands of accelerators running continuously, this adds up to millions of dollars. Worse, it slows down your replication speed. In the competitive world of AI where companies are racing to develop advanced models, doubling the amount of testing at the same time can be the difference between being ahead and being behind.

Environmental costs are also important. Larger models require more electricity to teach. Better performance means less computing time, which cuts energy consumption in half and carbon emissions. As more pressure is put on the AI industry to cut its carbon footprint, achieving efficiency is no longer a luxury but a necessity.

The solution we designed, Peer Direct, delivers RDMA-like functionality where the physical network architecture is unsuitable for traditional RDMA. We needed direct memory access between Gaudi devices on different systems without traversing host memory, but on host NICs that weren't designed for this in the first place.

The source was the AWS Elastic Fabric Adapter, an efficient network interface for HPC and AI workloads on EC2. EFA provides low OS-bypass communication, typically sub-10 microsecond latency. EFA provides RDMA-like semantics using libfabric, a user-space communication library that provides a common interface across multiple network technologies.

The task was to integrate libfabric with Habana's Collective Communication Library, HCCL, which handles all distributed training workloads. HCCL is built on native RDMA considerations using Gaudi's on-chip NICs. We needed to create a bridge that would allow HCCL to use libfabric transparently for communication without compromising its performance guarantees and communication semantics.

The solution required several technological advances. First, we introduced a memory registration system that allows libfabric to directly access Gaudi's high-bandwidth memory. We used the Linux kernel DMA-BUF framework, which provides a shared mechanism for sharing device driver buffers. When HCCL needs to transfer data, the Gaudi driver provides a DMA-BUF file descriptor for the memory region, which libfabric can use to create RDMA transfers directly from the device's memory.

Second, we installed the LRU cache for memory registration. Memory registration is expensive; it involves kernel calls and setup functions that can cause significant overhead. By caching the recording of memory addresses in their libfabric handles, we can reuse registrations in hot-access locations, removing most of the registration overhead from actual training.

Photo by author

The result was a communication pipeline that looked like this: HCCL calls the OFI wrapper, which calls a cached libfabric handle to perform an RDMA transfer directly from the source Gaudi memory to the destination Gaudi memory, without the CPU ever being called. The OFI wrapper is introduced to keep the codebase clean and avoid direct header integration – it's a lightweight library that connects dynamically to HCCL and allows the use of libfabric without requiring direct integration.

After the transfer is complete, libfabric reports a completion line, and HCCL continues to calculate the newly received data.

A Sense of Development

Building a Peer Direct involves entering a new location with tight schedules. Libfabric was not yet emerging in the field of AI accelerators. There were not many public documents available, and discussion was sparse. There was more emphasis on getting into the libfabric source code and reverse engineering based on testing.

Communication with AWS engineers was important but there were time zone limitations. Working with the team twelve hours earlier meant that the debug iteration had a 24-hour turnaround. Every story required careful documentation and proper communication, as real-time collaboration was impossible.

The stakes were high as all DL1 shipments were on board for this active mission. The delay would have hampered the big product launch. No one on our team had deep background knowledge of libfabric's internals, so we were learning the complex codebase while designing important integrations at the same time.

Results

When we actually used Peer Direct, the speed improvement was worth all the effort. We saw a 1.5 to 2x throughput increase in aggregate performance at 32MB message size. For larger messages, performance was even more impressive, up to 1.76x better for a message size of 256MB. The high CPU created a bottleneck that disappeared completely.

Most importantly, these microbenchmark improvements translated directly into the performance of the actual training model. Training Habana's DeepSpeed BERT model with 5 billion parameters on all 128 Gaudi devices, we saw a huge performance gain. Models that use aggressive memory optimization methods, such as ZeRO-2, which rely heavily on cluster performance, benefit equally from Peer Direct.

PeerDirect is one of the main enablers of Gaudi's functionality in AWS DL1 instances, allowing high-quality distributed training to run easily on launch day. Despite this initial impact, the effort laid the groundwork for future high-performance networking features and proved that cloud-native AI accelerators can remain competitive despite cloud infrastructure constraints.

The experience reminded me of an important lesson in systems engineering: often the most significant performance improvements come not from fixing the fastest way, but from completely deviating from unjustified deviations. During distributed AI training, data flow is direct to all accelerators with no redundant copies and no CPU intervention is what makes the application scalable.

What do you eat? One important “takeaway” from this project is that assumptions about network topology should be tested very quickly in a distributed training system. Since most accelerator stacks are built based on an ideal environment, they ignore the extra hops, translation layers, and/or cost-driven features that exist in cloud environments. Therefore, before focusing on optimizing any model level or kernel level, developers should perform simple integrated microbenchmarking across the desired topology. If the scaling efficiency decreases significantly with increasing node count or message sizes, the reason may be the data path, not the kernel. By identifying the path of host memory deviation early, developers can focus their efforts where they will have the greatest impact.

Another important lesson learned was the need to treat both memory registration and data transfer as first-stage performance concerns. Over memory registration can greatly exceed the time spent on the connection if each data transfer requires a new registration. The LRU cache of registered memories was an unattractive addition to HCCL; however, it effectively eliminated the system's source of latency and made the RDMA method work for real workloads. When developing distributed systems, developers must factor in not only the available network bandwidth but also the life cycle costs associated with allocating buffers, registering them, and deregistering them. Small changes in these control methods can lead to large increases in end-to-end output.

Finally, the integration method used in this project provides an integration pattern. Instead of rewriting HCCL to use libfabric directly, we created a thin abstraction layer that preserves the existing semantics while changing the underlying transport layer. This provided several benefits, including reducing risk, reducing code clutter, and allowing for incremental testing. Teams facing the same challenge (ie, adapting native libraries to accelerate cloud fabrics) should try to isolate the transport layer, preserve cluster semantics, and create small, testable connections between the two. This not only allows for faster development but also allows easy support for future transport backends.

Disclosure: I work as an AI Runtime team manager at Intel. The opinions shared in this article are my own.

Source link

nimda February 25, 2026

0 7 7 minutes read