Teaching Neural Networks Mandelbrot Set

Introduction
the set is one of the most beautiful mathematical objects ever discovered, a fractal so complex that no matter how much you zoom in, you continue to find endless details. But what if we asked a neural network to learn it?
At first glance, this sounds like a strange question. The Mandelbrot set is fully deterministic, no data, no noise, no hidden rules. But this simplicity makes it an ideal sandbox for studying how neural networks represent complex tasks.
In this article, we will explore how a simple neural network can learn to approximate the Mandelbrot set, and how Gaussian Fourier features completely changed its performance, turning blurry limitations into sharp fractal boundaries.
Along the way, we'll get into why vanilla multi-layer perceptrons (MLPs) struggle with high-frequency patterns (the spectral bias problem) and how Fourier functions solve it.
Mandelbrot set
The Mandelbrot set is defined over the complex plane. For each complex number (cin mathbb{C}), we consider the recursive sequence:
$$z_{n+1} = z_n^2 + c, z_0 = 0$$
If this sequence is always finite, then (c) belongs to the Mandelbrot set.
In practice, we estimate this condition using escape time algorithm. The escape time algorithm iterates the sequence up to a finite number of steps and monitors the maximum (|z_n|). If (|z_n|) exceeds a chosen escape radius (usually 2), the sequence is guaranteed to be divergent, and (c) is considered outside the mandelbrot set. If the sequence does not escape within a maximum number of iterations, (c) is assumed to belong to the Mandelbrot set, and the iteration equation is often used for demonstration purposes.
Converting the Mandelbrot Set into a Learning Problem
To train our neural network, we need two things. First, we must define a learning problemthat is, what the model should predict and what the input is. Second, we need it labeled data: a large collection of input-output pairs drawn from that problem.
Defining a Learning Problem
At its core, the Mandelbrot set defines a function over a complex plane. Each point (c = x +iy in mathbb{C}) is mapped to a result: either the sequence generated by the iteration remains finite, or it diverges. This immediately raises the problem of binary classification, where the input is a complex number and the output indicates whether the number is within the set or not.
However, this structure causes difficulty in reading. The boundary of the Mandelbrot set is infinitely complex, and an arbitrarily small perturbation of (c) can change the result of the partition. From a learning point of view, this leads to a target performance that is not very continuous, making the configuration unstable and inefficient for the data.
To achieve a smooth and informative learning goal, we instead reframe the problem using time to escape information presented in the previous section. Rather than predicting a binary label, the model is trained to predict a continuous variable from an iterative equation where the sequence is heavy.
To generate the continuous target function we do not use the escape multiplication equation directly. The escape iteration count is a different value, using this will introduce discontinuities, especially near the boundary of the Mandelbrot set, where small changes in (c) can cause large jumps in the iteration count. To deal with this, we use a sliding escape time value, which combines the number of escape iterations to generate a continuous target. In addition, we use a logarithmic scale that distributes the early escape values and suppresses the large ones, giving a balanced target distribution.
def smooth_escape(x: float, y: float, max_iter: int = 1000) -> float:
c = complex(x, y)
z = 0j
for n in range(max_iter):
z = z*z + c
r2 = z.real*z.real + z.imag*z.imag
if r2 > 4.0:
r = math.sqrt(r2)
mu = n + 1 - math.log(math.log(r)) / math.log(2.0) # smooth
# log-scale to spread small mu
v = math.log1p(mu) / math.log1p(max_iter)
return float(np.clip(v, 0.0, 1.0))
return 1.0
With this definition, the Mandelbrot set becomes a regression problem. A neural network is trained to estimate the function $$f : mathbb{R}^2 rightarrow [0,1]$$
to map the spatial coordinates ((x, y)) in the complex plane to a smooth escape time value.
Data Sampling Strategy
Uniform sampling of a complex plane will be inefficient, many points are far from the boundary and carry little information. To address this, the dataset is biased towards border regions by multi-sampling and filtering points whose escape values fall within a pre-defined band.
def build_boundary_biased_dataset(
n_total=800_000,
frac_boundary=0.7,
xlim=(-2.4, 1.0),
res_for_ylim=(3840, 2160),
ycenter=0.0,
max_iter=1000,
band=(0.35, 0.95),
seed=0,
):
"""
- Mix of uniform samples + boundary-band samples.
- 'band' selects points with target in (low, high), which tends to concentrate near boundary.
"""
rng = np.random.default_rng(seed)
ylim = compute_ylim_from_x(xlim, res_for_ylim, ycenter=ycenter)
n_boundary = int(n_total * frac_boundary)
n_uniform = n_total - n_boundary
# Uniform set
Xu = sample_uniform(n_uniform, xlim, ylim, seed=seed)
# Boundary pool: oversample, then filter by band
pool_factor = 20
pool = sample_uniform(n_boundary * pool_factor, xlim, ylim, seed=seed + 1)
yp = np.empty((pool.shape[0],), dtype=np.float32)
for i, (x, y) in enumerate(pool):
yp[i] = smooth_escape(float(x), float(y), max_iter=max_iter)
mask = (yp > band[0]) & (yp < band[1])
Xb = pool[mask]
yb = yp[mask]
if len(Xb) < n_boundary:
# If band too strict, relax it automatically
keep = min(len(Xb), n_boundary)
print(f"[warn] Boundary band too strict; got {len(Xb)} boundary points, using {keep}.")
Xb = Xb[:keep]
yb = yb[:keep]
n_boundary = keep
n_uniform = n_total - n_boundary
Xu = sample_uniform(n_uniform, xlim, ylim, seed=seed)
else:
Xb = Xb[:n_boundary]
yb = yb[:n_boundary]
yu = np.empty((Xu.shape[0],), dtype=np.float32)
for i, (x, y) in enumerate(Xu):
yu[i] = smooth_escape(float(x), float(y), max_iter=max_iter)
X = np.concatenate([Xu, Xb], axis=0).astype(np.float32)
y = np.concatenate([yu, yb], axis=0).astype(np.float32)
# Shuffle once
perm = rng.permutation(X.shape[0])
return X[perm], y[perm], ylim
Basic Model: Deep Residual MLP
Our first attempt uses a deep residual MLP that takes the raw Cartesian coordinates ((x, y)) as input and predicts the slip escape value.
# Baseline model
class MLPRes(nn.Module):
def __init__(
self,
hidden_dim=256,
num_blocks=8,
act="silu",
dropout=0.0,
out_dim=1,
):
super().__init__()
activation = nn.ReLU if act.lower() == "relu" else nn.SiLU
self.in_proj = nn.Linear(2 , hidden_dim)
self.in_act = activation()
self.blocks = nn.Sequential(*[
ResidualBlock(hidden_dim, act=act, dropout=dropout)
for _ in range(num_blocks)
])
self.out_ln = nn.LayerNorm(hidden_dim)
self.out_act = activation()
self.out_proj = nn.Linear(hidden_dim, out_dim)
def forward(self, x):
x = self.in_proj(x)
x = self.in_act(x)
x = self.blocks(x)
x = self.out_act(self.out_ln(x))
return self.out_proj(x)
# Residual block
class ResidualBlock(nn.Module):
def __init__(self, dim: int, act: str = "silu", dropout: float = 0.0):
super().__init__()
activation = nn.ReLU if act.lower() == "relu" else nn.SiLU
# pre-norm-ish (LayerNorm helps a lot for stability with deep residual MLPs)
self.ln1 = nn.LayerNorm(dim)
self.fc1 = nn.Linear(dim, dim)
self.ln2 = nn.LayerNorm(dim)
self.fc2 = nn.Linear(dim, dim)
self.act = activation()
self.drop = nn.Dropout(dropout) if dropout and dropout > 0 else nn.Identity()
# optional: small init for the last layer to start near-identity
nn.init.zeros_(self.fc2.weight)
nn.init.zeros_(self.fc2.bias)
def forward(self, x):
h = self.ln1(x)
h = self.act(self.fc1(h))
h = self.drop(h)
h = self.ln2(h)
h = self.fc2(h)
return x + h
This network has sufficient capacity: it is deep, residual, and trained on a large dataset with a stable configuration.
The result
The universality of the Mandelbrot set is clearly visible. However, fine details near the border seem blurry. Regions that should display complex fractal structures appear overly smooth, and thin threads are poorly defined or completely absent.
This is not a matter of resolution, data, or depth. So what went wrong?
The Problem of Spectral Bias
Neural networks have a well-known problem called spectral bias:
they often read low-frequency functions first, and struggling to represent jobs with quick turnarounds or fine details.
The Mandelbrot border is very irregularly controlled, full of small structures, especially near its border. To capture it, the network will need to represent more high-frequency variation in the output as (x) and (y) change.
Fourier Functions: Coordinating Coding in Frequency Space
One of the best solutions to the spectral bias problem was presented in 2020 by Tancik et al. in their paper Fourier Factors Allow Networks to Learn High-Frequency Functions in Low-Reflection Domains.
The idea is to transform the input links before feeding them into a neural network. Instead of rendering ((x, y)) raw, we pass its sinusoidal projections otno random directions in a high-dimensional space.
Officially:
$$gamma(x)=[sin(2 pi Bx),cos(2 pi Bx)]$$
where (B in mathbb{R}^{d_{in}×d_{feat}}) is a random Gaussian matrix.
This map works as random expansion of the Fourier basisallowing the network to more easily represent high frequency information.
class GaussianFourierFeatures(nn.Module):
def __init__(self, in_dim=2, num_feats=256, sigma=5.0):
super().__init__()
B = torch.randn(in_dim, num_feats) * sigma
self.register_buffer("B", B)
def forward(self, x):
proj = (2 * torch.pi) * (x @ self.B)
return torch.cat([torch.sin(proj), torch.cos(proj)], dim=-1)
Multi-Scale Gaussian Fourier Characteristics
A single frequency scale may not be sufficient. The Mandelbrot set shows structure at all resolutions (a distinctive feature of fractal geometry).
To capture this, we use multiple Gaussian Fourier featurescombining several frequency bands:
class MultiScaleGaussianFourierFeatures(nn.Module):
def __init__(self, in_dim=2, num_feats=512, sigmas=(2.0, 6.0, 10.0), seed=0):
super().__init__()
# split features across scales
k = len(sigmas)
per = [num_feats // k] * k
per[0] += num_feats - sum(per)
Bs = []
g = torch.Generator()
g.manual_seed(seed)
for s, m in zip(sigmas, per):
B = torch.randn(in_dim, m, generator=g) * s
Bs.append(B)
self.register_buffer("B", torch.cat(Bs, dim=1))
def forward(self, x):
proj = (2 * torch.pi) * (x @ self.B)
return torch.cat([torch.sin(proj), torch.cos(proj)], dim=-1)
This effectively provides the network with a frequency base for multiple solutions, perfectly aligned with its fractal nature.
The last model
The final model has the same properties as the base model, the only difference is that it uses MultiScaleGaussianFourierFeatures.
class MLPFourierRes(nn.Module):
def __init__(
self,
num_feats=256,
sigma=5.0,
hidden_dim=256,
num_blocks=8,
act="silu",
dropout=0.0,
out_dim=1,
):
super().__init__()
self.ff = MultiScaleGaussianFourierFeatures(
2,
num_feats=num_feats,
sigmas=(2.0, 6.0, sigma),
seed=0
)
self.in_proj = nn.Linear(2 * num_feats, hidden_dim)
self.blocks = nn.Sequential(*[
ResidualBlock(hidden_dim, act=act, dropout=dropout)
for _ in range(num_blocks)
])
self.out_ln = nn.LayerNorm(hidden_dim)
activation = nn.ReLU if act.lower() == "relu" else nn.SiLU
self.out_act = activation()
self.out_proj = nn.Linear(hidden_dim, out_dim)
def forward(self, x):
x = self.ff(x)
x = self.in_proj(x)
x = self.blocks(x)
x = self.out_act(self.out_ln(x))
return self.out_proj(x)
Training Dynamics
Training Without Fourier functions
The model gradually learns the complete shape of the Mandelbrot set, but then plateaus. Additional training fails to add more detail.
Training in Four Aspects
Here the coarse structures appear first, followed by progressively finer details. The model continues to improve its prediction rather than collapse.
Final results
Both models used the same architecture, data set, and training procedure. The network is a deep residual multilayer perceptron (MLP) trained as a regression model in the construction of smooth escape time.
- Data set: 1.000.000 samples from the complex plane, with 70% points concentrated near the fractal boundary due to biased data sampling.
- Architecture: A residual MLP with 20 residual blocks, and a hidden size of 512 units.
- Implementation Activity: SiLU
- Training: 100 epochs, batch size 4096, Adam-based optimizer, Cosine Annealing scheduler.
The only difference between the two models is the representation of input coordinates. The basic model uses raw Cartesian coordinates, while the second model uses high-order Fourier functions in the display.
Global View


Zoom in 1 View


Zoom in 2 View


Conclusions
Fractals such as the Mandelbrot set are an extreme example of functions dominated by high-frequency structures. Estimating them directly from raw links forces neural networks to synthesize increasingly detailed oscillations, a task MLPs are not well suited for.
What this article shows is that the limitation is not the capacity of the buildings, the capacity of the data, or the optimization. Corner representation.
By encoding multidimensional Gaussian Fourier coordinates, we move a large part of the problem to the input space. The high-frequency structure becomes transparent, allowing a neural network to estimate a function that would otherwise be too complex in its original form.
This idea goes beyond fractals. Coordinate-based neural networks are used in computer graphics, learning physics, and signal processing. In all of these settings, the choice of encoding can be the difference between smooth measurements and a rich, highly detailed structure.
A note on tangible assets
All images, animations, and videos shown in this article were produced by the author from the results of the neural network models described above. No external fractal renderers or third-party visualization assets were used. The full code used for training, rendering, and generating animations is available in the repository associated with this article.
Complete the Code
Github repo



