AI on multiple GPUs: ZeRO and FSDP

nimda March 5, 2026

0 21 8 minutes read

of a thread about distributed AI on multiple GPUs:

Introduction

In a previous post, we saw how Distributed Data Parallelism (DDP) speeds up training by splitting clusters across GPUs. DDP solves the output problem, but introduces a new challenge: memory loss.

In vanilla DDP, every GPU holds a complete copy of the model parameters, gradients, and optimizer states. On larger models like the GPT-3 (175B parameters), this redundancy becomes a huge waste of precious VRAM.

Image by author: Model, gradients and optimizer no longer work on all GPUs in standard DDP

ZeRO (Zero Redundancy Optimizer) solves this. There are three levels:

ZeRO-1 only parts of the optimizer
ZeRO-2 partitions optimizer is + gradients
ZeRO-3 the partitions optimizer is + gradients + model parameters

ZeRO is not a compatibility method because all GPUs still use the same forward and backward passes. It is a and improving memory a strategy that eliminates redundancy across GPUs, allowing you to train large models on the same hardware.

Memory Problem in DDP

Let's break down what actually eats up memory during training. For a model with parameters:

Model parameters: values (the weights of your neural network)
Gradients: values (one gradient per parameter)
Optimizer States (Adam): values (first moment and second moment of each parameter)
Getting started: Average results saved during the forward pass for use in the reverse pass

The first three scales with model size and are not required for all GPUs in DDP. Average activation by batch size, sequence length, and # of neurons, and so on different for each GPU as each GPU processes different data. ZeRO does not affect the activation memory.

Let's calculate the memory usage of the 7B parameter model using Adam and FP32:

Parameters: 7 billion * 4 bytes = 28 GB
Gradients: 7 billion * 4 bytes = 28 GB
The Optimizer says: 7 billion * 2 * 4 bytes = 56 GB
Memory per GPU in DDP: 112 GB

Activation adds significant memory on top of this, but since it's different for each GPU, ZeRO can't differentiate. The strategies are the same to activate the test can help, discarding other activations and restoring them as needed during rollback. But that is outside the scope of this article.

Let's understand how ZeRO works by running it from the ground up, starting with ZeRO-1 and working our way to ZeRO-3.

ZeRO-1: Optimizer State Partitioning

In ZeRO-1, only the optimizer says are separated. For each GPU:

He is still holding the full model parameters and gradients
Stores only 1/N conditions for the optimizer (N = number of GPUs)
It only updates the relevant ones 1/N of parameters

These are the steps taken during the training:

Forward: Each GPU processes its own micro-batch
Back pass: combine the gradients
all-reduce gradients: every GPU gets every gradient
Optimizer step: Each GPU updates its own parameter classification
all-gather parameters: sync the updated model across all GPUs

Here is a simplified implementation:

import torch
import torch.distributed as dist


class ZeRO_1:
    def __init__(self, model, optimizer_cls):
        self.model = model
        self.rank = dist.get_rank()
        self.world_size = dist.get_world_size()

        self.param_shards = list()  # each rank holds only its shard of the optimizer states
        self.param_metadata = list()  # metadata to reconstruct shards

        for param in self.model.parameters():
            original_shape = param.data.shape
            flat = param.data.view(-1)
            numel = flat.numel()

            remainder = numel % self.world_size
            pad_size = (self.world_size - remainder) % self.world_size
            padded_numel = numel + pad_size
            shard_size = padded_numel // self.world_size

            shard_start = self.rank * shard_size
            shard_end = shard_start + shard_size

            self.param_metadata.append(
                {
                    "original_shape": original_shape,
                    "numel": numel,
                    "padded_numel": padded_numel,
                    "shard_size": shard_size,
                    "shard_start": shard_start,
                    "shard_end": shard_end,
                }
            )

            if pad_size > 0:
                flat_padded = torch.cat([flat, flat.new_zeros(pad_size)])
            else:
                flat_padded = flat

            shard = flat_padded[shard_start:shard_end].clone()
            shard.requires_grad_(True)
            self.param_shards.append(shard)

        self.optimizer = optimizer_cls(self.param_shards)

    def training_step(self, inputs, targets, loss_fn):
        output = self.model(inputs) # forward
        loss = loss_fn(output, targets) # compute loss
        loss.backward() # backward

        self._sync_gradients()  # all-reduce gradients across GPUs
        self.optimizer.step() # update local shard of parameters
        self._sync_params() # all gather model params

        # clear gradients for the next step
        for param in self.model.parameters():
            param.grad = None

    def _sync_gradients(self):
        for idx, param in enumerate(self.model.parameters()):
            meta = self.param_metadata[idx]

            dist.all_reduce(param.grad, op=dist.ReduceOp.SUM)
            param.grad /= self.world_size

            self.param_shards[idx].grad = param.grad.view(-1)[meta["shard_start"]:meta["shard_end"]]

    def _sync_params(self):
        for idx, param in enumerate(self.model.parameters()):
            meta = self.param_metadata[idx]

            full_flat = torch.empty(meta["padded_numel"], device=param.device, dtype=param.dtype)
            dist.all_gather_into_tensor(
                output_tensor=full_flat,
                input_tensor=self.param_shards[idx].data,
            )
            
            reconstructed = full_flat[:meta["numel"]].view(meta["original_shape"])
            param.data.copy_(reconstructed)

Note that i all-reduce harmonize everything gradients, but each GPU uses only gradients to separate the parameter, it is overclocking. ZeRO-2 fixes this by sharing gradients with it.

In fact, you will never use ZeRO-1 as ZeRO-2 gives you better memory savings for the same cost. But you still have to go through it for learning purposes.

Memory with ZeRO-1, model 7B, 8 GPUs:

Parameters: 28 GB (fully duplicated)
Gradients: 28 GB (fully replicated)
The Optimizer says: 56 GB / 8 = 7 GB
Total per GPU: 63 GB (down from GB)

ZeRO-2: Gradient Partitioning

ZeRO-2 components are both optimizer circuits and gradients. Since each GPU only updates part of the parameters, it only needs the corresponding gradients.

Use of ZeRO-1 all-reducewhich gives all GPU all gradients. ZeRO-2 replaces this with reduce-scattereach GPU only gets the gradients it really needs. This saves both memory and communication bandwidth.

Training steps:

Forward: Each GPU processes its own micro-batch
Back pass: combine the gradients
reduce-scatter gradients: Each GPU gets its own partition
Optimizer step: Each GPU updates its own parameter classification
all-gather parameters: sync the updated model across all GPUs

The implementation is very similar to ZeRO-1, but a gradient synchronization step is used reduce-scatter instead of all-reduce:
But wait, if every GPU calculates every gradient during backprop, how does this save VRAM? Here's how:

As the parameter gradients are calculated layer by layer, they become faster reduce-scattered and the local copy is released (our simplified implementation does not do this).
During backprop, you only need the gradient of the next neuron's activity to calculate the current param gradient, that is, you don't need the entire gradient graph.
That way you can free up memory for the gradients as you go backwards, keeping only the part allocated to each GPU.

Memory with ZeRO-2, model 7B, 8 GPUs:

Parameters: 28 GB (fully duplicated)
Gradients: 28 GB / 8 = 3.5 GB
The Optimizer says: 56 GB / 8 = 7 GB
Total per GPU: 38.5 GB (down from 112 GB)

ZeRO-3: Parameter Partitioning

ZeRO-3 partitions optimizer states, gradients, and boundaries. Each GPU stores only 1/N of the entire state of the model.

During forward and backward passes, each layer needs its full parameters, but each GPU only stores a fraction. So we all-collect parameters just in timeuse them, and throw them away soon after.

Training steps:

Forward pass:
- Collect all layer parameters for all GPUs
- Run a layer forward pass using the previous layer's activation as input
- Discard collected parameters (only keep local partition)
- Repeat these steps until all layers are done

Back pass (each layer, back):
- Collect all and layer parameters
- Calculate the gradients for the current layer using the active gradients from the next layer
- Reduce-spread gradients (each GPU maintains its own shard)
- Discard collected parameters (only keep local partition)
- Repeat these steps until all layers are done

Each GPU runs an optimizer step on its own partition
No final rounding is required as the parameters are collected layer by layer during the forward pass

Here is a simplified implementation:

class ZeRO_3(ZeRO_2):
    """
    ZeRO-3: Shard optimizer states (stage 1) + gradients (stage 2) + model parameters (stage 3).

    At rest, each rank holds only param_shards[idx] — a 1/world_size slice
    of each parameter. Full parameters are materialised temporarily during
    the forward and backward passes via all_gather, then immediately freed.
    """

    def __init__(self, model, optimizer_cls):
        self.model = model
        self.rank = dist.get_rank()
        self.world_size = dist.get_world_size()

        self.param_metadata = []
        shard_list = []

        self._param_to_idx = {}

        for idx, param in enumerate(self.model.parameters()):
            original_shape = param.data.shape
            flat = param.data.view(-1)
            numel = flat.numel()

            remainder = numel % self.world_size
            pad_size = (self.world_size - remainder) % self.world_size
            padded_numel = numel + pad_size
            shard_size = padded_numel // self.world_size

            shard_start = self.rank * shard_size
            shard_end = shard_start + shard_size

            self.param_metadata.append(
                {
                    "original_shape": original_shape,
                    "numel": numel,
                    "padded_numel": padded_numel,
                    "shard_size": shard_size,
                    "shard_start": shard_start,
                    "shard_end": shard_end,
                }
            )

            if pad_size > 0:
                flat_padded = torch.cat([flat, flat.new_zeros(pad_size)])
            else:
                flat_padded = flat

            shard = flat_padded[shard_start:shard_end].clone()
            shard_list.append(shard)

            # Replace the full tensor with only this rank's shard.
            # The model's param.data now points to a tiny slice; the full
            # weight will be reconstructed on demand during forward/backward.
            param.data = shard.detach()
            self._param_to_idx[param] = idx

        self.param_shards = [s.requires_grad_(True) for s in shard_list]
        self.optimizer = optimizer_cls(self.param_shards)

        self._register_hooks()

    def _gather_param(self, idx, device, dtype):
        """All-gather the full parameter tensor for parameter `idx`."""
        meta = self.param_metadata[idx]
        full_flat = torch.empty(meta["padded_numel"], device=device, dtype=dtype)
        dist.all_gather_into_tensor(
            output_tensor=full_flat,
            input_tensor=self.param_shards[idx].data,
        )
        return full_flat[: meta["numel"]].view(meta["original_shape"])

    def _gather_module_params(self, module):
        """Gather full params for every parameter that belongs to this module only (not children)."""
        for param in module.parameters(recurse=False):
            idx = self._param_to_idx[param]
            param.data = self._gather_param(idx, param.device, param.dtype)

    def _reshard_module_params(self, module):
        """Reshard params back to local shard for every direct param of this module."""
        for param in module.parameters(recurse=False):
            idx = self._param_to_idx[param]
            param.data = self.param_shards[idx].data

    def _register_hooks(self):
        self._hooks = []
        for module in self.model.modules():
            # Skip container modules that have no direct parameters
            if not list(module.parameters(recurse=False)):
                continue

            # Forward: gather -> run -> reshard
            h1 = module.register_forward_pre_hook(
                lambda mod, _inputs: self._gather_module_params(mod)
            )
            h2 = module.register_forward_hook(
                lambda mod, _inputs, _output: self._reshard_module_params(mod)
            )

            # Backward: gather before grad computation → reshard after
            h3 = module.register_full_backward_pre_hook(
                lambda mod, _grad_output: self._gather_module_params(mod)
            )
            h4 = module.register_full_backward_hook(
                lambda mod, _grad_input, _grad_output: self._reshard_module_params(mod)
            )

            self._hooks.extend([h1, h2, h3, h4])

    def training_step(self, inputs, targets, loss_fn):
        # Hooks handle all gather/reshard around each module automatically
        output = self.model(inputs)
        loss = loss_fn(output, targets)
        loss.backward()

        self._sync_gradients()

        # Each rank updates only its local shard
        self.optimizer.step()

        for param in self.model.parameters():
            param.grad = None

The parameters of each layer are collected before they are needed and released immediately afterwards. This keeps memory overhead small at the cost of more communication. In practice, the implementation skips the entire roundup of layer N+1 and forward of layer N to hide the delay.

Memory with ZeRO-3, 7B model, 8 GPUs:

Parameters: 28 GB / 8 = 3.5 GB
Gradients: 28 GB / 8 = 3.5 GB
The Optimizer says: 56 GB / 8 = 7 GB
Total per GPU: 14 GB (down from 112 GB)

that is 8x reduction in memory usage, which is exactly what we would expect from a split of 8 GPUs.

Using ZeRO in PyTorch

PyTorch ships with two ZeRO-3 implementations: FSDP1 (old, slightly modified) again FSDP2 (new, recommended). Always use FSDP2.

FSDP (Fully Sharded Data Parallel) handles parameter collection, gradient scattering, connection stacking, and memory management automatically:

from torch.distributed.fsdp import fully_shard

model = Transformer()
for layer in model.layers:
    fully_shard(layer)
fully_shard(model)

You must apply fully_shard layer-by-layer and wrap the entire model.

The conclusion

ZeRO swaps memory for communication, so it's not a free lunch. It is generally not suitable for small models (eg BERT) but is a game changer for large models.

Congratulations on reaching the end! In this post, learn about:

Memory replication problem in standard DDP
ZeRO partitions optimizer means, gradients, and parameters for all GPUs
Three phases of ZeRO and the memory/connection trade-off
How to use ZeRO-3 with FSDP for PyTorch

In the next article, we'll explore Tensor Parallelism, a parallelism model that speeds up layer computations by distributing the work across GPUs.