Machine Learning

Time Series Isn’t Enough: How Graph Neural Networks Change Demand Forecasting

in supply-chain planning has traditionally been treated as a time-series problem.

  • Each SKU is modeled independently.
  • A rolling time window (say, last 14 days) is used to predict tomorrow’s sales.
  • Seasonality is captured, promotions are added, and forecasts are reconciled downstream.

And yet, despite increasingly sophisticated models, the usual problems persist:

  • Chronic over-and under-stocking
  • Emergency production changes
  • Excess inventory sitting in the wrong place
  • High forecast accuracy on paper, but poor planning outcomes in practice

The issue is that demand in a supply chain is not independent. It is networked. As an example, this is what just 12 SKUs from a typical supply chain look like when you map their shared plants, product groups, subgroups, and storage locations.

So when demand shifts in one corner of the network, the effects are felt throughout the network.

In this article, we step outside the model-first thinking and look at the problem the way a supply chain actually behaves — as a connected operational system. Using a real FMCG dataset, we show why even a simple graph-based neural network(GNN) fundamentally outperforms traditional approaches, and what that means for both business leaders and data scientists.

A real supply chain experiment

We tested this idea on a real FMCG dataset (SupplyGraph) that combines two views of the business:

Static supply-chain relationships

The dataset has 40 active SKUs, 9 plants, 21 product groups, 36 sub-groups and 13 storage locations. On average, each SKU has ~41 edge connections, implying a densely connected graph where most SKUs are linked to many others through shared plants or product groups..

From a planning standpoint, this network encodes institutional knowledge that often lives only in planners’ heads:

“If this SKU spikes, these others will feel it.”

Temporal operational signals and sales outcomes

The dataset has temporal data for 221 days. For each SKU and each day, the dataset includes:

  • Sales orders (the demand signal)
  • Deliveries to distributors
  • Factory goods issues
  • Production volumes

Here is an overview of the four temporal signals driving the supply chain model:

Feature Total Volume (Units) Daily Avg Sparsity (Zero-Activity Days) Max Single Day
Sales Order 7,753,184 35,082 46.14% 115,424
Delivery To Distributor 7,653,465 34,631 35.79% 66,470
Factory Issue 7,655,962 34,642 43.94% 75,302
Production 7,660,572 34,663 61.96% 74,082

As can be observed, almost half of the SKU-Day combinations have zero sales. The implication being a small fraction of SKUs drives most of the volume. This is a classic “Intermittent Demand” problem.

Also, manufacturing occurs in infrequent, large batches (lumpy production). Downstream delivery is much smoother and more frequent (low sparsity) implying the supply chain uses significant inventory buffers.

To stabilize GNN learning and handle extreme skew, all values are transformed using log1p, a standard practice in intermittent demand forecasting.

Key Business Metrics

What does a good demand forecast look like ? We evaluate the model based on two metrics; WAPE and Bias

WAPE — Weighted Absolute Percentage Error

WAPE measures how much of your total demand volume is being mis-allocated. Instead of asking “How wrong is the forecast on average across all SKUs?“, WAPE asks the question supply-chain planners actually care about in the scenario of intermittent demand: “Of all SKU units that were moved through the supply chain to meet demand, what fraction was mis-forecast?

This matters because errors on high-volume SKUs cost far more than errors on long-tail items. A 10% miss on a top seller is more expensive than a 50% miss on a slow mover. So WAPE weights the SKU-days by volume sold, and aligns more naturally with revenue impact, inventory exposure, plant and logistics utilization (and can be further weighted by price/SKU if required).

That’s why WAPE is widely preferred over MAPE for intermittent, high-skew demand.

[
text{WAPE} =
frac{sum_{s=1}^{S}sum_{t=1}^{T} left| text{Actual}_{s,t} – text{Forecast}_{s,t} right|}
{sum_{s=1}^{S}sum_{t=1}^{T} text{Actual}_{s,t}}
]

WAPE can be calculated at different levels — product group, region or total business — and over different durations, such as weekly or monthly.

It is important to note that here, WAPE is computed at the hardest possible level — per-SKU, per-day, on intermittent demand — not after aggregating volumes across products or time. In FMCG planning practice, micro-level SKU-daily WAPE of 60–70% is often considered acceptable for intermittent demand, whereas <60% is considered production-grade forecasting.

Forecast Bias — Directional Error

Bias measures whether your forecasts systematically push inventory up or down. While WAPE tells you how wrong the forecast is, Bias tells you how operationally expensive it is. It answers a simple but critical question: “Do we consistently over-forecast or under-forecast?”. As we will see in the next section, it is possible to have zero bias while being wrong most of the time. In practice, positive bias results in excess inventory, higher holding costs and write-offs whereas negative bias leads to stock-outs, lost sales and service penalties. In practice, a little positive bias (2-5%) is considered production-safe.

[ text{Bias} = frac{1}{S} sum_{s=1}^{S} (text{Forecast}_s – text{Actual}_s) ]

Together, WAPE and Bias determine whether a model is not just accurate, its forecasts are operationally and financially usable.

The Baseline: Forecasting Without Structure

To establish a ground floor, we start with a naïve baseline, which is “tomorrow’s sales equal today’s sales”.

[ hat{y}_{t+1} = y_t ]

This approach has:

  • Zero bias
  • No network awareness
  • No understanding of operational context

Despite its simplicity, it is a strong benchmark, especially over the short term. If a model cannot beat this baseline, it is not learning anything meaningful.

In our experiments, the naïve approach produces a WAPE of 0.86, meaning nearly 86% of total volume is misallocated.

The bias of zero is not a good indicator in this case, since errors cancel out statistically while creating chaos operationally.

This leads to:

  • Firefighting
  • Emergency production changes
  • Expediting costs

This aligns with what many practitioners experience: Simple forecasts are stable — but wrong where it matters.

Adding the Network: Spatio-Temporal GraphSAGE

We use GraphSAGE, a graph neural network that allows each SKU to aggregate information from its neighbors.

Key characteristics:

  • All relationships are treated uniformly.
  • Information is shared across connected SKUs.
  • Temporal dynamics are captured using a time series encoder.

This model does not yet distinguish between plants, product groups, or storage locations. It simply answers the key question:

“What happens when SKUs stop forecasting in isolation?”

Implementation

While I will dive deeper into the data science behind the feature engineering, training, and evaluation of GraphSAGE in a subsequent article, here are some of the key principles to understand:

  • The graph with its nodes and edges forms the static spatial features.
  • The spatial encoder component of GraphSAGE, with its convolutional layers, generates spatial embeddings of the graph.
  • The temporal encoder (LSTM) processes the sequence of spatial embeddings, capturing the evolution of the graph over the last 14 days (using a sliding window approach).
  • Finally, a regressor predicts the log1p-transformed sales for the next day.

An intuitive analogy

Imagine you’re trying to predict the price of your house next month. The price isn’t just influenced by the history of your own house — like its age, maintenance, or ownership records. It’s also influenced by what’s happening in your neighborhood.

For example:

  • The condition and prices of houses similar to yours (similar construction quality),
  • How well-maintained other houses in your area are,
  • The availability and quality of shared services like schools, parks, or local law enforcement.

In this analogy:

  • Your house’s history is like the temporal features of a particular SKU (e.g., sales, production, delivery history).
  • Your neighborhood represents the graph structure (the edges connecting SKUs with shared attributes, like plants, product groups, etc.).
  • The history of nearby houses is like the neighboring SKUs’ features — it’s how the behavior of other similar houses/SKUs influences yours.

The purpose of training the GraphSAGE model is for it to learn the function f that can be applied to each SKU based on its own historical features (like sales, production, factory issues, etc.) and the historical behavior of its connected SKUs, as determined by the edge relationships (e.g., shared plant, product group, etc.). To depict it more precisely:

embedding_i
  f( own_features_i
     neighbors’ features
     relationships )

where those features come from the SKU’s own operational history and the history of its connected neighbors.

The Result: A Structural Step-Change

The impact is quite remarkable:

Model WAPE
Naïve baseline 0.86
GraphSAGE ~0.62

In practical terms:

  • The naïve approach misallocates nearly 86% of total demand volume
  • GraphSAGE reduces this error by ~27%

The following chart shows the actual vs predicted sales on the log1p scale. The diagonal red line depicts perfect forecast, where predicted = actual. As can be seen, most of the high volume SKUs are clustered around the diagonal which depicts good accuracy.

Acual vs Predicted (log scale)

From a business perspective, this translates into:

  • Fewer emergency production changes
  • Better plant-level stability
  • Less manual firefighting
  • More predictable inventory positioning

Importantly, this improvement comes without any additional business rules — only by allowing information to flow across the network.

And the bias comparison is as follows:

Model Mean Forecast Bias (Units) Bias %
GraphSAGE ~733 +31 ~4.5%
Naïve ~701 0 0%

At under 5%, the mild forecasting bias GraphSAGE introduces is well within production-grade limits. The following chart depicts the error in the predictions.

Prediction error

It can be observed that:

  • Error is negligible for most of the forecasts. Recall from the temporal analysis that sparsity in sales is 46%. This shows that the model has learned this, and is correctly predicting “Zero” (or very close to it) for those SKU-days, creating the peak at the center.
  • The shape of the bell curve is tall and narrow, which indicates high precision. Most errors are tiny and clustered around zero.
  • There is little skew of the bell curve from the center line, confirming the low bias of 4.5% we calculated.

In practice, many organizations already bias forecasts deliberately to protect service levels, rather than risk stock-outs.

Let’s look at the impact at the SKU level. The following chart shows the forecasts for the top 4 SKUs by volume, denoted by red dotted lines, against the actuals.

Forecast vs Actual — Top 4 SKUs

A few observations:

  • The forecast is reactive in nature. As marked in green circles in the first chart, the forecast follows the actual on the way up, and also down without anticipating the next peak well. This is because GraphSAGE considers all relations to be homogeneous (equally important), which is not true in reality.
  • The model under-predicts extreme spikes and compresses the upper tail aggressively. GraphSAGE prefers stability and smoothing.

Here is a chart showing the performance across SKUs with non-zero volumes. Two threshold lines are marked at WAPE of 60% and 75%. 3 of the 4 highest volume SKUs have a WAPE < 60% with the fourth one just above. From a planning perspective, this is a robust and balanced forecast.

Performance across SKUs

Takeaway

Graph neural networks do more than improve forecasts — they change how demand is understood. While not perfect, GraphSAGE demonstrates that structure matters more than model complexity.

Instead of treating each SKU as an independent problem, it allows planners to reason over the supply chain as a connected system.

In manufacturing, that shift — from isolated accuracy to network-aware decision-making — is where forecasting begins to create real economic value.

What’s next? From Connections to Meaning

GraphSAGE showed us something powerful: SKUs do not live in isolation — they live in networks.
But in our current model, every relationship is treated as equal.

In reality, that is not how supply chains work.

A shared plant creates very different dynamics than a shared product group. A shared warehouse matters differently from a shared brand family. Some relationships propagate demand shocks. Others dampen them.

GraphSAGE can see that SKUs are connected — but it cannot learn how or why they are connected.

That is where Heterogeneous Graph Transformers (HGT) come in.

HGT allows the model to learn different behaviors for different types of relationships — letting it weigh, for example, whether plant capacity, product substitution, or logistics constraints should matter more for a given forecast.

In the next article, I will show how moving from “all edges are equal” to relationship-aware learning unlocks the next level of forecasting accuracy — and improves the quality of forecast by adding meaning to the network.

That is where graph-based demand forecasting becomes truly operational.

Connect with me and share your comments at www.linkedin.com/in/partha-sarkar-lets-talk-AI

Reference

SupplyGraph: A Benchmark Dataset for Supply Chain Planning using Graph Neural Networks : Authors: Azmine Toushik Wasi, MD Shafikul Islam, Adipto Raihan Akib

Images used in this article are synthetically generated. Charts and underlying code created by me.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button