Sakana AI Proposes DiffusionBlocks: a Block-wise Training Framework That Converts Residual Networks into Independently Trainable Denoising Modules

nimda May 28, 2026

0 9 8 minutes read

Sakana AI Proposes DiffusionBlocks: a Block-wise Training Framework That Converts Residual Networks into Independently Trainable Denoising Modules

Researchers from Sakana AI and the University of Tokyo propose DiffusionBlocks. It trains transformer-based networks one block at a time. Training memory is reduced by a factor of B, where B is the number of blocks. Performance is maintained across diverse architectures.

The Memory Problem in Neural Network Training

End-to-end backpropagation requires storing intermediate activations across every layer. Memory consumption grows linearly with network depth. As models grow deeper, this becomes a significant training bottleneck.

One existing technique, activation checkpointing, reduces activation memory by recomputing activations on demand. However, it does not reduce memory for parameters, gradients, or optimizer states. With the Adam optimizer, each layer requires memory for parameters, gradients, and two optimizer states (momentum and variance). This totals 4 times the parameter size per layer, unchanged by activation checkpointing.

Block-wise training offers a different approach. Partitioning a network into B blocks and training each independently reduces memory to roughly 1/B. The reduction is proportional to the number of blocks. The challenge is defining a principled local objective for each block that still produces a globally coherent model.

Prior approaches like Hinton’s Forward-Forward algorithm and greedy layer-wise training rely on ad-hoc local objectives. They consistently underperform end-to-end training and are largely limited to classification tasks.

DiffusionBlocks addresses both the theoretical gap and the limited applicability of prior methods.

The Core Idea: Residual Connections as Euler Steps

The key insight builds on an established connection in the literature. Residual networks update each layer input via $zℓ = zℓ−1 + fθℓ (zℓ−1)$ . This corresponds to Euler discretization of ordinary differential equations.

The research team show these updates correspond specifically to the probability flow ODE in score-based diffusion models. In the Variance Exploding (VE) formulation, the reverse diffusion process follows:

$frac{mathrm{d}mathbf{z}_sigma}{mathrm{d}sigma} = -sigma nabla_{mathbf{z}} log p_sigma(mathbf{z}_sigma)$

Applying Euler discretization to this equation produces an update rule that structurally matches the residual connection update. A stack of residual blocks can be interpreted as discretized denoising steps. The steps span a noise level range [𝞂_min, 𝞂_max].

In score-based diffusion models, the score matching objective can be optimized independently at each noise level. This means each block can be trained independently, using only its own local objective. No inter-block communication is needed during training.

Converting a Network: Three Steps

Converting a standard residual network to DiffusionBlocks requires three modifications:

Block partitioning: Split the L-layer network into B blocks. Each block contains a contiguous group of layers.
Noise range assignment: Define a noise distribution p_noise and a noise range [𝞂_min, 𝞂_max]. Partition this range into B intervals and assign one interval to each block. The research team recommend a log-normal distribution for p_noise.
Noise conditioning: Extend each block’s input to include a noisy version of the target. Add noise-level conditioning via AdaLN (Adaptive Layer Normalization). Each block learns to predict the clean target from its noisy version within its assigned noise range.

During training, a single block is sampled per iteration. The other blocks are not computed. Memory consumption corresponds to L/B layers, not all L layers.

Equi-probability Partitioning

A naive uniform partition divides [𝞂_min, 𝞂_max] into equal intervals. This ignores the varying difficulty of denoising across noise levels. Intermediate noise levels contribute the most to generation quality under the log-normal training distribution.

DiffusionBlocks uses equi-probability partitioning instead. Boundaries are chosen so each block handles exactly 1/B of the total probability mass under p_noise. Blocks assigned to intermediate noise levels receive narrower intervals. Blocks handling extreme noise regions receive wider intervals.

In ablation studies on CIFAR-10 using DiT-S/2, block overlap was disabled to isolate each component. Equi-probability partitioning achieved FID of 38.03 versus 43.53 for uniform partitioning (lower is better). Both used a uniform layer distribution of [4,4,4] across 3 blocks.

Experimental Results

The research team evaluated DiffusionBlocks across five architectures spanning three task categories. All results compare DiffusionBlocks (trained block-wise) against the same architecture trained with end-to-end backpropagation.

Architecture	Dataset	Metric	Baseline	DiffusionBlocks	Memory Reduction
ViT, 12-layer, B=3	CIFAR-100	Accuracy (higher is better)	60.25%	59.30%	3x
DiT-S/2, 12-layer, B=3	CIFAR-10	FID test (lower is better)	39.83	37.20	3x
DiT-L/2, 24-layer, B=3	ImageNet 256×256	FID test (lower is better)	12.09	10.63	3x
MDM, 12-layer, B=3	text8	BPC (lower is better)	1.56	1.45	3x
AR Transformer, 12-layer, B=4	LM1B	MAUVE (higher is better)	0.50	0.71	4x
AR Transformer, 12-layer, B=4	OpenWebText	MAUVE (higher is better)	0.85	0.82	4x
Huginn recurrent-depth	LM1B	MAUVE (higher is better)	0.49	0.70	~10x compute

Forward-Forward comparison: On CIFAR-100, the Forward-Forward algorithm achieved only 7.85% accuracy under the same ViT architecture. This highlights the gap between ad-hoc contrastive objectives and the score matching objective used by DiffusionBlocks.

DiT inference efficiency: For diffusion models, each denoising step during inference activates only one block. A 12-layer DiT with B=3 uses only 4-layer evaluations per denoising step. This is a 3x inference compute reduction versus running all 12 layers.

Huginn training: Huginn applies the same 4-layer recurrent block recurrently. It uses stochastic recurrence depth averaging 32 iterations. Training uses 8-step truncated backpropagation through time (BPTT). DiffusionBlocks replaces this with a single forward pass per training step. The K-iteration inference procedure is kept unchanged. The 32x iteration reduction outweighs the 3x longer training schedule. DiffusionBlocks trains for 15 epochs versus Huginn’s 5 epochs. Total compute is reduced by approximately 10x.

OpenWebText results: On OpenWebText, DiffusionBlocks MAUVE was 0.82 versus 0.85. Generative perplexity under Llama-2 was 14.99 versus 15.05. Results on this dataset were mixed, with some metrics slightly worse than the baseline.

Masked diffusion partitioning: For masked diffusion models, block partitioning targets the masking schedule rather than continuous noise levels. Each block handles an equal decrement in the unmasking probability alpha

Comparison with NoProp

NoProp is a concurrent work that uses a diffusion framework for backpropagation-free training. It is evaluated only on classification tasks using a custom CNN-based architecture. It does not provide a procedure for applying the method to other architectures or tasks.

Method	Continuous-time	Block-wise	Accuracy on CIFAR-100
Backpropagation	No	No	47.80%
NoProp-DT	No	Yes	46.06%
NoProp-CT	Yes	No	21.31%
NoProp-FM	Yes	No	37.57%
DiffusionBlocks (ours)	Yes	Yes	46.88%

DiffusionBlocks is the only method combining a continuous-time formulation with block-wise training. It stays within 1 percentage point of the end-to-end backpropagation baseline.

Strengths and Weaknesses

Strengths:

Principled theoretical grounding via score matching, not ad-hoc local objectives
Works across five distinct architectures without task-specific modifications
B× training memory reduction, proportional to the number of blocks
For diffusion models, inference compute is also reduced by B× during generation
Equi-probability partitioning significantly outperforms uniform partitioning (FID 38.03 vs 43.53 on CIFAR-10)
Replaces K-iteration BPTT in recurrent-depth models with a single forward pass
Blocks can be trained in parallel across GPUs with zero communication overhead
Moderate block counts (B=2 or B=3) sometimes improve FID over end-to-end training

Weaknesses:

Requires matching input and output dimensions; cannot currently be applied to U-Net-style architectures
Validated only on models trained from scratch; fine-tuning of pretrained models is untested
No principled method for selecting optimal block count for a given architecture and task
Adds noise conditioning overhead: aggregated wall time is 0.0543s versus 0.0507s under standard training
On OpenWebText, some metrics are marginally worse than the autoregressive baseline

Marktechpost’s Visual Explainer

DiffusionBlocks · Sakana AI

ICLR 2026 · Block-wise Training

01 / 10

A Quick Guide

Sakana AI and the University of Tokyo propose DiffusionBlocks, a framework that partitions transformer-based networks into independently trainable blocks. Training memory is reduced by a factor of B, where B is the number of blocks.

Each block is trained independently via a score matching objective derived from continuous-time diffusion
Residual connections in transformers map to Euler steps of the reverse diffusion process
Validated on ViT, DiT, masked diffusion, autoregressive, and recurrent-depth transformers
For diffusion models, inference also activates only one block per denoising step

02 / 10

The Problem

Memory Grows Linearly With Network Depth

End-to-end backpropagation requires storing intermediate activations across every layer. As models grow deeper, memory consumption grows in step.

Activation checkpointing reduces activation memory by recomputing on demand. It does not reduce memory for parameters, gradients, or optimizer states.

With Adam, each layer needs memory for parameters, gradients, and two optimizer states (momentum and variance). This totals roughly 4x the parameter size per layer.

O(L)

Activation memory under end-to-end backprop

Per-layer memory for parameters, gradients, and optimizer states under Adam

O(L/B)

Memory footprint under DiffusionBlocks training

03 / 10

The Core Idea

Residual Connections as Euler Steps of Reverse Diffusion

Residual networks update each layer input via z_l = z_{l-1} + f_tl(z_{l-1}). This corresponds to Euler discretization of an ordinary differential equation.

The authors show these updates correspond specifically to the probability flow ODE in score-based diffusion models, under the Variance Exploding formulation.

dz_sigma / d_sigma = -sigma · grad_z log p_sigma(z_sigma)

A stack of residual blocks can therefore be interpreted as discretized denoising steps. The score matching objective can be optimized independently at each noise level, so each block trains alone.

04 / 10

Conversion Recipe

Three Modifications to Any Residual Network

Step 01

Block Partitioning

Split the L-layer network into B blocks. Each block contains a contiguous group of layers.

Step 02

Noise Range Assignment

Define a log-normal noise distribution and partition the range into B intervals. Assign one interval to each block.

Step 03

Noise Conditioning

Extend each block input with a noisy version of the target. Add noise-level conditioning via AdaLN.

During training, one block is sampled per iteration. Other blocks are not computed. Memory corresponds to L/B layers, not L.

05 / 10

Partitioning Strategy

Equi-Probability, Not Uniform, Intervals

A uniform partition divides the noise range into equal intervals. This ignores that intermediate noise levels contribute the most to generation quality.

DiffusionBlocks chooses boundaries so each block handles exactly 1/B of the total probability mass under the log-normal training distribution.

Partition Strategy	Layer Distribution	FID (CIFAR-10)
Uniform	[4, 4, 4]	43.53
Equi-Probability	[4, 4, 4]	38.03

Ablation on DiT-S/2 with block overlap disabled. Lower FID is better.

06 / 10

Experimental Results

Tested Across Five Architectures, Three Task Categories

Architecture	Dataset	Metric	Baseline	DiffusionBlocks	Memory
ViT, 12L, B=3	CIFAR-100	Accuracy ↑	60.25%	59.30%	3x
DiT-S/2, 12L, B=3	CIFAR-10	FID test ↓	39.83	37.20	3x
DiT-L/2, 24L, B=3	ImageNet 256	FID test ↓	12.09	10.63	3x
MDM, 12L, B=3	text8	BPC ↓	1.56	1.45	3x
AR Transformer, B=4	LM1B	MAUVE ↑	0.50	0.71	4x
AR Transformer, B=4	OpenWebText	MAUVE ↑	0.85	0.82	4x

07 / 10

Recurrent-Depth Models

Huginn: K-Iteration BPTT Becomes a Single Forward Pass

Huginn applies a 4-layer recurrent block with stochastic recurrence depth averaging 32 iterations during training. Standard training uses 8-step truncated backpropagation through time (BPTT).

Under DiffusionBlocks, training is a single forward pass per step. The K-iteration inference procedure is kept unchanged.

0.70

MAUVE on LM1B (vs 0.49 baseline)

16.08

Perplexity under Llama-2 (vs 17.04 baseline)

~10x

Less total training compute

08 / 10

Comparison with NoProp

The Only Continuous-Time, Block-Wise Method in the Comparison

Method	Continuous-Time	Block-Wise	CIFAR-100 Accuracy
Backpropagation	No	No	47.80%
NoProp-DT	No	Yes	46.06%
NoProp-CT	Yes	No	21.31%
NoProp-FM	Yes	No	37.57%
DiffusionBlocks	Yes	Yes	46.88%

Run on NoProp’s custom CNN architecture for a fair comparison.

09 / 10

Trade-offs

Strengths and Current Limitations

Strengths

Principled grounding via score matching, not ad-hoc local objectives
B× training memory reduction proportional to block count
Works across five distinct architectures unchanged
Inference cost also reduced B× for diffusion models
Replaces K-iteration BPTT in recurrent-depth models with a single forward pass
Blocks train in parallel with zero communication overhead

Limitations

Requires matching input and output dimensions, so cannot be applied to U-Net
Validated only on models trained from scratch, not via fine-tuning
No principled rule for selecting optimal block count
Adds noise conditioning overhead in wall time
On OpenWebText, some metrics are marginally lower than the baseline

10 / 10

Paper, Code, and Project Page

Published at ICLR 2026 by Makoto Shing, Masanori Koyama, and Takuya Akiba. Full implementation and experimental configurations are open.

01 / 10

Key Takeaways

DiffusionBlocks partitions residual networks into B independently trainable blocks, reducing training memory by a factor of B
Residual connections in transformers map to Euler steps of the reverse diffusion process, providing a principled local training objective for each block
Equi-probability partitioning assigns equal probability mass per block, not equal noise intervals, improving image generation FID significantly over uniform partitioning
Validated across five architectures: ViT, DiT, masked diffusion, autoregressive, and recurrent-depth transformers
For recurrent-depth models like Huginn, replaces K-iteration BPTT with a single forward pass, reducing total training compute by approximately 10x

Check out the Research Paper, Repo and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Source link

nimda May 28, 2026

0 9 8 minutes read