AI on multiple GPUs: ZeRO and FSDP

of a thread about distributed AI on multiple GPUs:
Introduction
In a previous post, we saw how Distributed Data Parallelism (DDP) speeds up training by splitting clusters across GPUs. DDP solves the output problem, but introduces a new challenge: memory loss.
In vanilla DDP, every GPU holds a complete copy of the model parameters, gradients, and optimizer states. On larger models like the GPT-3 (175B parameters), this redundancy becomes a huge waste of precious VRAM.
ZeRO (Zero Redundancy Optimizer) solves this. There are three levels:
- ZeRO-1 only parts of the optimizer
- ZeRO-2 partitions optimizer is + gradients
- ZeRO-3 the partitions optimizer is + gradients + model parameters
ZeRO is not a compatibility method because all GPUs still use the same forward and backward passes. It is a and improving memory a strategy that eliminates redundancy across GPUs, allowing you to train large models on the same hardware.
Memory Problem in DDP
Let's break down what actually eats up memory during training. For a model with parameters:
- Model parameters: values (the weights of your neural network)
- Gradients: values (one gradient per parameter)
- Optimizer States (Adam): values (first moment and second moment of each parameter)
- Getting started: Average results saved during the forward pass for use in the reverse pass
The first three scales with model size and are not required for all GPUs in DDP. Average activation by batch size, sequence length, and # of neurons, and so on different for each GPU as each GPU processes different data. ZeRO does not affect the activation memory.
Let's calculate the memory usage of the 7B parameter model using Adam and FP32:
- Parameters: 7 billion * 4 bytes = 28 GB
- Gradients: 7 billion * 4 bytes = 28 GB
- The Optimizer says: 7 billion * 2 * 4 bytes = 56 GB
- Memory per GPU in DDP: 112 GB
Activation adds significant memory on top of this, but since it's different for each GPU, ZeRO can't differentiate. The strategies are the same to activate the test can help, discarding other activations and restoring them as needed during rollback. But that is outside the scope of this article.
Let's understand how ZeRO works by running it from the ground up, starting with ZeRO-1 and working our way to ZeRO-3.
ZeRO-1: Optimizer State Partitioning
In ZeRO-1, only the optimizer says are separated. For each GPU:
- He is still holding the full model parameters and gradients
- Stores only 1/N conditions for the optimizer (N = number of GPUs)
- It only updates the relevant ones 1/N of parameters
These are the steps taken during the training:
- Forward: Each GPU processes its own micro-batch
- Back pass: combine the gradients
all-reducegradients: every GPU gets every gradient- Optimizer step: Each GPU updates its own parameter classification
all-gatherparameters: sync the updated model across all GPUs

Here is a simplified implementation:
import torch
import torch.distributed as dist
class ZeRO_1:
def __init__(self, model, optimizer_cls):
self.model = model
self.rank = dist.get_rank()
self.world_size = dist.get_world_size()
self.param_shards = list() # each rank holds only its shard of the optimizer states
self.param_metadata = list() # metadata to reconstruct shards
for param in self.model.parameters():
original_shape = param.data.shape
flat = param.data.view(-1)
numel = flat.numel()
remainder = numel % self.world_size
pad_size = (self.world_size - remainder) % self.world_size
padded_numel = numel + pad_size
shard_size = padded_numel // self.world_size
shard_start = self.rank * shard_size
shard_end = shard_start + shard_size
self.param_metadata.append(
{
"original_shape": original_shape,
"numel": numel,
"padded_numel": padded_numel,
"shard_size": shard_size,
"shard_start": shard_start,
"shard_end": shard_end,
}
)
if pad_size > 0:
flat_padded = torch.cat([flat, flat.new_zeros(pad_size)])
else:
flat_padded = flat
shard = flat_padded[shard_start:shard_end].clone()
shard.requires_grad_(True)
self.param_shards.append(shard)
self.optimizer = optimizer_cls(self.param_shards)
def training_step(self, inputs, targets, loss_fn):
output = self.model(inputs) # forward
loss = loss_fn(output, targets) # compute loss
loss.backward() # backward
self._sync_gradients() # all-reduce gradients across GPUs
self.optimizer.step() # update local shard of parameters
self._sync_params() # all gather model params
# clear gradients for the next step
for param in self.model.parameters():
param.grad = None
def _sync_gradients(self):
for idx, param in enumerate(self.model.parameters()):
meta = self.param_metadata[idx]
dist.all_reduce(param.grad, op=dist.ReduceOp.SUM)
param.grad /= self.world_size
self.param_shards[idx].grad = param.grad.view(-1)[meta["shard_start"]:meta["shard_end"]]
def _sync_params(self):
for idx, param in enumerate(self.model.parameters()):
meta = self.param_metadata[idx]
full_flat = torch.empty(meta["padded_numel"], device=param.device, dtype=param.dtype)
dist.all_gather_into_tensor(
output_tensor=full_flat,
input_tensor=self.param_shards[idx].data,
)
reconstructed = full_flat[:meta["numel"]].view(meta["original_shape"])
param.data.copy_(reconstructed)
Note that i all-reduce harmonize everything gradients, but each GPU uses only gradients to separate the parameter, it is overclocking. ZeRO-2 fixes this by sharing gradients with it.
In fact, you will never use ZeRO-1 as ZeRO-2 gives you better memory savings for the same cost. But you still have to go through it for learning purposes.
Memory with ZeRO-1, model 7B, 8 GPUs:
- Parameters: 28 GB (fully duplicated)
- Gradients: 28 GB (fully replicated)
- The Optimizer says: 56 GB / 8 = 7 GB
- Total per GPU: 63 GB (down from GB)
ZeRO-2: Gradient Partitioning
ZeRO-2 components are both optimizer circuits and gradients. Since each GPU only updates part of the parameters, it only needs the corresponding gradients.
Use of ZeRO-1 all-reducewhich gives all GPU all gradients. ZeRO-2 replaces this with reduce-scattereach GPU only gets the gradients it really needs. This saves both memory and communication bandwidth.
Training steps:
- Forward: Each GPU processes its own micro-batch
- Back pass: combine the gradients
reduce-scattergradients: Each GPU gets its own partition- Optimizer step: Each GPU updates its own parameter classification
all-gatherparameters: sync the updated model across all GPUs

The implementation is very similar to ZeRO-1, but a gradient synchronization step is used reduce-scatter instead of all-reduce:
But wait, if every GPU calculates every gradient during backprop, how does this save VRAM? Here's how:
- As the parameter gradients are calculated layer by layer, they become faster
reduce-scatteredand the local copy is released (our simplified implementation does not do this). - During backprop, you only need the gradient of the next neuron's activity to calculate the current param gradient, that is, you don't need the entire gradient graph.
- That way you can free up memory for the gradients as you go backwards, keeping only the part allocated to each GPU.
Memory with ZeRO-2, model 7B, 8 GPUs:
- Parameters: 28 GB (fully duplicated)
- Gradients: 28 GB / 8 = 3.5 GB
- The Optimizer says: 56 GB / 8 = 7 GB
- Total per GPU: 38.5 GB (down from 112 GB)
ZeRO-3: Parameter Partitioning
ZeRO-3 partitions optimizer states, gradients, and boundaries. Each GPU stores only 1/N of the entire state of the model.
During forward and backward passes, each layer needs its full parameters, but each GPU only stores a fraction. So we all-collect parameters just in timeuse them, and throw them away soon after.
Training steps:
- Forward pass:
- Collect all layer parameters for all GPUs
- Run a layer forward pass using the previous layer's activation as input
- Discard collected parameters (only keep local partition)
- Repeat these steps until all layers are done
- Back pass (each layer, back):
- Collect all and layer parameters
- Calculate the gradients for the current layer using the active gradients from the next layer
- Reduce-spread gradients (each GPU maintains its own shard)
- Discard collected parameters (only keep local partition)
- Repeat these steps until all layers are done
- Each GPU runs an optimizer step on its own partition
- No final rounding is required as the parameters are collected layer by layer during the forward pass

Here is a simplified implementation:
class ZeRO_3(ZeRO_2):
"""
ZeRO-3: Shard optimizer states (stage 1) + gradients (stage 2) + model parameters (stage 3).
At rest, each rank holds only param_shards[idx] — a 1/world_size slice
of each parameter. Full parameters are materialised temporarily during
the forward and backward passes via all_gather, then immediately freed.
"""
def __init__(self, model, optimizer_cls):
self.model = model
self.rank = dist.get_rank()
self.world_size = dist.get_world_size()
self.param_metadata = []
shard_list = []
self._param_to_idx = {}
for idx, param in enumerate(self.model.parameters()):
original_shape = param.data.shape
flat = param.data.view(-1)
numel = flat.numel()
remainder = numel % self.world_size
pad_size = (self.world_size - remainder) % self.world_size
padded_numel = numel + pad_size
shard_size = padded_numel // self.world_size
shard_start = self.rank * shard_size
shard_end = shard_start + shard_size
self.param_metadata.append(
{
"original_shape": original_shape,
"numel": numel,
"padded_numel": padded_numel,
"shard_size": shard_size,
"shard_start": shard_start,
"shard_end": shard_end,
}
)
if pad_size > 0:
flat_padded = torch.cat([flat, flat.new_zeros(pad_size)])
else:
flat_padded = flat
shard = flat_padded[shard_start:shard_end].clone()
shard_list.append(shard)
# Replace the full tensor with only this rank's shard.
# The model's param.data now points to a tiny slice; the full
# weight will be reconstructed on demand during forward/backward.
param.data = shard.detach()
self._param_to_idx[param] = idx
self.param_shards = [s.requires_grad_(True) for s in shard_list]
self.optimizer = optimizer_cls(self.param_shards)
self._register_hooks()
def _gather_param(self, idx, device, dtype):
"""All-gather the full parameter tensor for parameter `idx`."""
meta = self.param_metadata[idx]
full_flat = torch.empty(meta["padded_numel"], device=device, dtype=dtype)
dist.all_gather_into_tensor(
output_tensor=full_flat,
input_tensor=self.param_shards[idx].data,
)
return full_flat[: meta["numel"]].view(meta["original_shape"])
def _gather_module_params(self, module):
"""Gather full params for every parameter that belongs to this module only (not children)."""
for param in module.parameters(recurse=False):
idx = self._param_to_idx[param]
param.data = self._gather_param(idx, param.device, param.dtype)
def _reshard_module_params(self, module):
"""Reshard params back to local shard for every direct param of this module."""
for param in module.parameters(recurse=False):
idx = self._param_to_idx[param]
param.data = self.param_shards[idx].data
def _register_hooks(self):
self._hooks = []
for module in self.model.modules():
# Skip container modules that have no direct parameters
if not list(module.parameters(recurse=False)):
continue
# Forward: gather -> run -> reshard
h1 = module.register_forward_pre_hook(
lambda mod, _inputs: self._gather_module_params(mod)
)
h2 = module.register_forward_hook(
lambda mod, _inputs, _output: self._reshard_module_params(mod)
)
# Backward: gather before grad computation → reshard after
h3 = module.register_full_backward_pre_hook(
lambda mod, _grad_output: self._gather_module_params(mod)
)
h4 = module.register_full_backward_hook(
lambda mod, _grad_input, _grad_output: self._reshard_module_params(mod)
)
self._hooks.extend([h1, h2, h3, h4])
def training_step(self, inputs, targets, loss_fn):
# Hooks handle all gather/reshard around each module automatically
output = self.model(inputs)
loss = loss_fn(output, targets)
loss.backward()
self._sync_gradients()
# Each rank updates only its local shard
self.optimizer.step()
for param in self.model.parameters():
param.grad = None
The parameters of each layer are collected before they are needed and released immediately afterwards. This keeps memory overhead small at the cost of more communication. In practice, the implementation skips the entire roundup of layer N+1 and forward of layer N to hide the delay.
Memory with ZeRO-3, 7B model, 8 GPUs:
- Parameters: 28 GB / 8 = 3.5 GB
- Gradients: 28 GB / 8 = 3.5 GB
- The Optimizer says: 56 GB / 8 = 7 GB
- Total per GPU: 14 GB (down from 112 GB)
that is 8x reduction in memory usage, which is exactly what we would expect from a split of 8 GPUs.
Using ZeRO in PyTorch
PyTorch ships with two ZeRO-3 implementations: FSDP1 (old, slightly modified) again FSDP2 (new, recommended). Always use FSDP2.
FSDP (Fully Sharded Data Parallel) handles parameter collection, gradient scattering, connection stacking, and memory management automatically:
from torch.distributed.fsdp import fully_shard
model = Transformer()
for layer in model.layers:
fully_shard(layer)
fully_shard(model)
You must apply fully_shard layer-by-layer and wrap the entire model.
The conclusion
ZeRO swaps memory for communication, so it's not a free lunch. It is generally not suitable for small models (eg BERT) but is a game changer for large models.
Congratulations on reaching the end! In this post, learn about:
- Memory replication problem in standard DDP
- ZeRO partitions optimizer means, gradients, and parameters for all GPUs
- Three phases of ZeRO and the memory/connection trade-off
- How to use ZeRO-3 with FSDP for PyTorch
In the next article, we'll explore Tensor Parallelism, a parallelism model that speeds up layer computations by distributing the work across GPUs.
References
- ZeRO: Memory Optimization in Three-Parameter Models (Original Paper)
- PyTorch FSDP Tutorial
- FSDP API Reference
- Ultra-Scale Playbook by Hugging Face



