Why Decade-Old Residual Connections Still Power All of AI (And Why That’s a Problem)

0 2 12 minutes read

Why Decade-Old Residual Connections Still Power All of AI (And Why That’s a Problem)

1.

Over the past decade, deep learning as a field has grown quite significantly, whether it be the compute capacity of hardware or the ingenuity behind architectures that utilize that hardware. But if you think about it for more than a second, the underlying architecture has remained consistent in a few key areas. We’ve seen a massive shift from convolutional networks to the new Transformer architectures that power today’s large language models, but the way these networks route information from one layer to another hasn’t changed all that much.

Recently, researchers at DeepSeek-AI released a paper titled “mHC: Manifold-Constrained Hyper-Connections,” (Xie et al., 2025b)¹ which proposes an entirely new redesign of this routing system. To really appreciate the solution they came up with, let’s look at how signal propagation has evolved over the past few generations of models, and why the current methods are hitting a wall.

2. The Backbone: Standard Residual Connections

Firstly, to understand the specific problem that the authors are trying to solve, we need to talk about where it all started–The standard Residual Connection (He et al., 2015)². Introduced back in 2015 with ResNets, the residual connection is arguably one of the most important architectural design choices used in every AI model out there.

(Source: Author)
Visual depiction of the Residual Connection

Mathematically, it looks like this:

(Source: Author)
x_l+1: Final output activation of the layer
x_l: Input activations to the layer
F(.): Transformations applied by the layer

It simply means that the final output of a layer is the sum of its output and the input it originally got. The key component here is that bare x_l term in the residual stream, which we call the identity mapping. It’s important because it acts as an uninterrupted pathway for the gradient signal to flow through the entire network from start to finish. This property is exactly what prevents gradients from vanishing or exploding during training and allows us to successfully train models with hundreds of layers while still ensuring each layer learns and updates itself effectively.

2.1 The Problem with Standard Residual Connections

But as models have grown increasingly massive, we’ve started to hit the limits of this straightforward approach.

In a standard transformer model, we can imagine the residual stream as having a fixed width, which we can refer to as dimension C. Every piece of context, memory, and feature representation has to be crammed into this single C-dimensional vector as it moves up the network. Over time, as the model layers make the information more abstract and expressive, the x_l term from the residual stream then becomes the information bottleneck.

Typically, if you want to increase the representational capacity of the model, you have to increase the size of the computational layers or add more layers. But by doing that, you also massively increase the compute requirements to run the very model.

2.2 The Improvement: Hyper-Connections (HC)

Because of the above-stated limitation, researchers at ByteDance introduced an alternative to the vanilla residual stream, known as Hyper-Connections (Zhu et al., 2024)³.

(Source: Author)
A visual diagram of information flow in unconstrained Hyper-Connections.

If the normal residual streams are just too “thin”, HC widens them. Instead of relying on a single stream of width C, the idea is to expand the width of the residual stream by a specific factor, let’s say n. So what you now end up with is a wider vector composed of n parallel streams, resulting in a total width of n×C.

But since the actual computational layers of the model, like the Attention and MLP blocks, still expect a standard input with C dimensions only, HC introduces a set of learnable weights to convert the vector between the wide and narrow stream:

A Pre-Mapping Matrix: This reads from the wide stream and condenses it down to size C.
A Post-Mapping Matrix: This takes the layer’s narrow output and expands it back into the wide stream.
A Residual Mapping Matrix: This sits directly on the residual pathway, and its purpose is to mix the information across the n parallel streams as the signal moves forward.

Fundamentally, by doing this, HC successfully increases the network’s capacity and makes the residual stream more expressive. The residual mapping matrix now enables the residual stream to not only allow the unperturbed signal to flow, but also the interactions between the channel dimensions. It allows the model to maintain a much richer internal representation across multiple streams, without increasing the compute cost of the main layers.

2.3 The Flaws in Hyper-Connections

The reality of the situation, however, is that while HC looks great on paper, it introduces a couple of fatal flaws when you try to scale it up to the size of what our current LLMs are:

Mathematical Instability: That Residual Mapping matrix, although expressive, destroys the crucial identity mapping property. Because it can learn any value, it no longer perfectly conserves the original signal. A tiny feature scale-up in one layer compounds exponentially when multiplied across fifty layers. DeepSeek actually found that the signal could be amplified by a staggering factor of 3,000, causing wildly erratic gradients and massive spikes in the training loss.
The Hardware Bottleneck: Widening the stream by a factor of n forces the memory hardware to read and write significantly more data at every single step. Since memory access—not the actual computation—is often the biggest bottleneck in modern AI training, this extra overhead tanks training throughput and spikes the GPU memory footprint by a substantial margin.

So, the researchers at DeepSeek were left with a very specific problem: how do you keep the expressive, wide streams of the HC paradigm, without destroying the mathematical stability of the network, and without saturating the GPU memory and I/O operations?

Let’s have a look at how they solved this.

3. The Solution: Manifold-Constrained Hyper-Connections (mHC)

To solve these two massive issues prevalent in HC, the DeepSeek team proposed a modified framework which they call Manifold-Constrained Hyper-Connections, or mHC.

The solution is broken down into two distinct parts. First, they had to fix the underlying math to stop the signal from exploding/vanishing. Second, they had to do some hardcore systems engineering to make sure the fix could actually run efficiently on modern GPUs. Let’s break down exactly how they did both of these.

3.1 Fixing the Math: The Birkhoff Polytope

The brilliant mathematical insight here was to take that problematic, unconstrained Residual Mapping matrix and mathematically force it to behave in a constrained manner. To do that, they projected the matrix onto a specific mathematical space known as the Birkhoff polytope.

In simpler terms, they constrained the matrix so that it becomes a doubly stochastic matrix.

If you aren’t familiar with the term, a doubly stochastic matrix is a matrix where all the numbers are non-negative, and every row sums up to exactly 1, and every column also sums up to exactly 1.

(Source: Author)
Illustration of a doubly-stochastic matrix

By forcing the residual matrix into this specific format, authors made sure of a few highly beneficial mathematical properties:

Norm Preservation (No More Explosions): Mathematically, the spectral norm of a doubly stochastic matrix is capped at 1, not more and not less. This means that no matter what the matrix learns, it physically cannot expand or diminish the gradient. This neutralizes the exploding/vanishing signal problem.
Compositional Closure (Deep Stability): If you multiply a doubly stochastic matrix by another doubly stochastic matrix, the result is still a doubly stochastic matrix. This ensures that the signal remains perfectly stable even when you compound these matrices across fifty or a hundred layers.
Perfect Mixing: Geometrically, this kind of matrix acts as a combination of multiple different ways information can be mixed around. This means that it can mix the information across those n parallel streams without artificially amplifying the overall “energy” of the signal.

To actually turn a regular matrix into a doubly stochastic one during training, the researchers used something called the Sinkhorn-Knopp algorithm (Sinkhorn & Knopp, 1967)⁵. During the forward pass, the algorithm first makes all the numbers in the matrix positive, and then iteratively rescales the rows and columns until they all sum to 1.

3.2 Fixing the Hardware: Hardcore Systems Engineering

Solving the math on paper is great, but running all these wide streams and iterative Sinkhorn-Knopp calculations sounds like a nightmare for GPU memory. To get around this, the DeepSeek team implemented some aggressive infrastructure optimizations:

Kernel Fusion: Instead of running the mathematical operations one by one (which requires constantly reading and writing to the GPU’s memory), they used a framework called TileLang (Wang et al., 2025)⁶ to write custom, unified GPU kernels. This allowed them to fuse the matrix multiplications, the normalization, and the Sinkhorn-Knopp iterations into a single operation, bypassing the memory overhead.
Selective Recomputing: Expanding the residual stream means you normally have to save a massive amount of intermediate data for the backward pass during training. This, in practise, would instantly cause the GPUs to run out of memory. To fix this, they throw the intermediate data away after the forward pass. They only keep the bare minimum inputs, and then quickly recompute the lightweight mHC using the unified kernels on the fly during the backward pass.
Overlapping Communication: In a distributed training system (multiple GPUs), because wider streams cause delays when communicating across GPUs, they had to change their scheduling system. By tweaking it, they hid the communication delays of the wide streams by running them simultaneously with the heavy computation of the attention layers, so that no part of the mHC is the rate-limiting step during training.

Ultimately, the result of all this systems engineering pays off. Despite all the added math and wider streams, mHC only adds a tiny 6.7% time overhead during training compared to a standard baseline model.

4. The Results: Did it Actually Work?

To see if all the math and system engineering actually paid off, the DeepSeek team put mHC to the test. They trained several language models based on the DeepSeek-V3 architecture (DeepSeek-Ai et al., 2024)⁴, scaling all the way up to a 27-billion parameter model. They compared their new mHC framework directly against a standard residual baseline and the unconstrained, unstable HC paradigm. Let’s take a look at how the experiments played out.

4.1 Restoring Training Stability

The main motivation behind mHC was to mitigate the erratic training behaviour that was observed in HC due to the unconstrained mapping matrices. As shown below, the standard HC model’s gradient norm (graph b) starts to destabilize with wild swings at around 12k steps, which is exactly the moment where we see the HC and mHC loss plots drift apart (graph a). Because of the smoother and more stable gradient norms with mHC, the model ultimately achieves a lower final training loss when compared to the vanilla HC.

(Source: Adapted from Xie et al., 2025, Figure 5)
Graph a: Plot of training loss vs training steps for three different variants of the same model. It demonstrates that the mHC-enabled model achieved the lowest training loss.
Graph b: Plot of Gradient norm vs training steps. It shows the highly unstable gradient norms that we get from the vanilla HC vs the smooth and predictable norms that we get from mHC.

4.2 Boosting Downstream Performance

A stable model is only useful if it’s actually smarter. To prove this, the authors evaluated the 27B variant across multiple downstream benchmarks, including MATH, MMLU, and reasoning tasks like BBH and DROP. As expected, the mHC-enabled model showed consistent performance gains across the board, and especially surpassed the unconstrained HC on a majority of benchmarks. The reasoning benchmarks saw a particularly nice gain in performance, indicating that the wider residual streams are actively contributing to a more expressive model.

(Source: Adapted from Xie et al., 2025, Table 4)
mHC surpasses the baseline (normal residual connections) and unconstrained HC on every benchmark except MATH.

4.3 Predictable and Robust Scaling

An important test for any new deep-learning architectural paradigm is if it obeys the pre-established scaling laws or not. Some design choices which work for a 3B parameter model might fail or backfire for a 27B parameter model. To ensure this, the authors plotted the compute scaling curves for 3B, 9B, and 24B parameter models. The below shown graphs clearly demonstrate that the relative loss improvement is maintained across all scales, validating that mHC is a scalable architectural upgrade.

(Source: Adapted from Xie et al., 2025, Figure 6 (a))
Left: Plot of Absolute Loss difference between mHC & Baseline vs model size (in FLOPs).
Right: Plot of Relative Loss difference between mHC and Baseline vs model size (in FLOPs).

4.4 Taming the Signal Explosion

As a final test, the authors also tested one of their claims directly: that the signal should not explode arbitrarily when stacked under multiple layers. For the standard unconstrained HC, we saw how the signal can be amplified by a factor of 3,000, which threw the gradients off completely during training. To see if mHC fixed this issue directly or not, DeepSeek tracked the signal propagation dynamics layer-by-layer in the model, and the results were as expected. Due to the doubly-stochastic mapping matrices, the signal gain was capped at around 1.6 throughout the model, proving that the signal remained stable even after compounding it across multiple layers.

(Source: Adapted from Xie et al., 2025, Figure 7)
Graph a: Plot of Signal Gain factor vs Layer (by index). The plot shows almost no gain across multiple different layers, showing that doubly-stochastic matrices sucessfully mitigate signal explosion.
Graph b: Plot of Signal Gain factor vs Layer (by index, compounded). The plot shows that when compounded, the signal gains a factor of about 1.6 at layer 20, which still remains healthy and bounded for training.

5. Counterfactuals: The “Gotchas” and Trade-offs

Before the end, let’s discuss about some of the flaws of mHC, as every engineering choice involves a trade-off. While mHC is a good alternative to the instability of Hyper-Connections, it does come with a few caveats that are worth mentioning.

The 6.7% Time Tax: DeepSeek proudly (and rightfully) notes that their infrastructure optimizations brought the training time overhead down to just 6.7% compared to a baseline model. While that does sound incredibly low, at the scale of training a massive LLM (100s of Billions of parameters) where GPU compute costs run into the tens of millions of dollars—a 6.7% increase in training time translates to a very real, very large financial cost. You will be paying a premium for that extra representational capacity.
Massive Engineering Complexity: You cannot simply open up a standard PyTorch script, type in the implementation directly with few lines of code, and expect to get these efficient results. To make mHC viable, the DeepSeek team had to write custom, low-level fused GPU kernels using TileLang, manually manage memory, and modify their pipeline scheduling. This significantly raises the barrier to entry. For smaller teams or researchers without dedicated infrastructure engineers, implementing mHC efficiently is going to be a massive overhead.
The Math is an Approximation: On paper, the Sinkhorn-Knopp algorithm turns the residual mapping matrix into a perfect doubly stochastic matrix. However, to get a perfect result, the algorithm technically needs to run for an infinite number of iterations. To keep things fast, the researchers cap it at 20 iterations. Because of this approximation, the matrix isn’t mathematically perfect in practice. If it was perfect, then we would observe a perfect 1.0 signal gain, but we don’t. The signal gain creeps up to a maximum of about 1.6 across the layers. It’s absolutely bounded and safe, it’s absolutely bounded and safe, for this scale, but for even bigger models (current LLMs are >500B parameters), this approximation might sway furthur away from ideal.

6. Conclusion: Final Thoughts and Adoptability

At the end of the day, the “mHC: Manifold-Constrained Hyper-Connections” paper is quite substantial research output by DeepSeek. It beautifully highlights what it takes to actually push the boundaries of foundational models today: you need a deep understanding of pure mathematics to diagnose the theoretical flaws, and you need hardcore systems engineering to make the solution actually run on physical silicon and make it practically viable.

The standard residual connection has been incredibly useful for the last decade, but as we push into the trillion-parameter era, we need pathways that can carry much richer, wider representations without affecting the stability of the network. DeepSeek has demonstrated one of the ways of achieveing wider and more representative pathways and innovated an aspect of architecture previously thought to be unchanging.

As for adoptability, will we see mHC accepted and implemented rapidly? Probably not. Because of the heavy reliance on custom GPU kernels and complex pipeline scheduling, it has quite a steep barrier which will likely take some time to be abstracted away into an easy-to-use plug-and-play module for the wider community. However, DeepSeek has already proven it works at scale in their own highly competitive roster of models.

Given the clear improvements in reasoning benchmarks and training stability, I fully expect well-resourced AI labs to start adopting and experimenting with mHC in their next-generation of architectures. It’s a big step forward, and it proves that there is still plenty of room to innovate on the most fundamental building blocks of neural networks.

7. References

Xie, Z., Wei, Y., Cao, H., et al. (2025). mHC: Manifold-Constrained Hyper-Connections. DeepSeek-AI. arXiv preprint arXiv:2512.24880.
He, K., Zhang, S., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
Zhu, D., Huang, H., Huang, Z., et al. (2024). Hyper-connections. arXiv preprint arXiv:2409.19606. (The original ByteDance paper proposing unconstrained HC).
Liu, A., Feng, B., Xue, B., et al. (2024). DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437.
Sinkhorn, R., & Knopp, P. (1967). Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2), 343-348. (The foundational mathematics behind the matrix projection).
Wang, L., Cheng, Y., Shi, Y., et al. (2025). TileLang: A composable tiled programming model for AI systems. arXiv preprint arXiv:2504.17577. (The framework used for mHC’s custom GPU kernel fusion).

Source link

nimda 3 hours ago

0 2 12 minutes read