Machine Learning

Five Questions About Chronos-2, the Time Series Foundation Model

mainstream. We first saw them in language, then vision, and now also in video and speech. The recipe by now is familiar: first, pretrain a big neural net on large enough data, then apply the model to downstream tasks without any per-task adaptation.

For many industrial applications, time series is a crucial modality. We frequently need to do forecasting, anomaly detection, and classification by using different kinds of recording data. The current practice is usually to build dedicated models for one specific problem at hand. That can work, but it involves quite some “reinventing the wheel”, and may deliver suboptimal performance if the dataset for the current problem is small.

Naturally, we’d like to ask: can we apply the same recipe here, that is, pretrain a large time-series foundation model and use it for any downstream tasks, out of the box?

That’s the bet behind time series foundation models, or TSFMs.

In fact, a lot of work has already gone down this path, and we now see a zoo of such models, to name a few: TimesFM from Google, MOIRAI from Salesforce, Lag-Llama, TimeGPT, and the Chronos family from AWS.

In this post, we look at Chronos-2 [1], the latest model in the Chronos line, released in October 2025. We’ll walk through five questions one might ask when encountering the model for the first time:

  1. What is a time series foundation model, and how does it change the analytics workflow?
  2. Why would a foundation model even work for time series?
  3. What is Chronos-2, specifically?
  4. What new things can we actually do with Chronos-2?
  5. Where does zero-shot stop being enough?

For question 4, we’ll get hands-on with a case study on a synthetic building electricity demand dataset.


1. What is a time series foundation model, and how does it change the analytics workflow?

As implied by its name, a TSFM is a single neural network pretrained on a large, diverse collection of time series. Its promise is the same as LLMs for text, i.e., instead of training a fresh model every time a new forecasting problem comes up, you load one pretrained model and ask it to forecast.

That’s a big shift to the workflow.

Let’s say we’d like to do week-ahead energy demand forecasts for buildings. If we follow the traditional workflow, we’d start with preparing the data, followed by picking forecasting models, think of ARIMA/gradient-boosted trees/LSTM, TCN, N-BEATS, and then spending most of the project time on training, hyperparameter tuning, and validation. The output is a model that (hopefully) solves this one problem on this one dataset.

Six months later, a new forecasting task arrives, and the cycle restarts almost from scratch.

Now with TSFM, most of what I described above is compressed into a single inference call. The workflow now becomes: Take the historical series (if available, also related covariates, we will discuss that later), input to the pretrained TSFM, set the desired forecast horizon, then run the TSFM inference and get back a forecast.

What’s also nice about it is that you won’t just get a point forecast, but typically with predictive quantiles to quantify uncertainties.

So what does this imply?

Well, the first thing is that the cost of just trying out a forecast drops a lot. If it works, great. If not, you’ve learned something useful in just ten minutes.

Then, cold-start is no longer a big issue. In the past, you might have had to stop a project simply because “we don’t really have enough data yet.” With a pretrained model, that “little data” might already be sufficient to deliver something meaningful. The model has already seen a lot of demand/traffic/sensor-like patterns. It’s bringing prior knowledge that your tiny dataset can’t fully represent.

Finally, who can do this changes too. It used to take an ML expert to do proper forecasting. A TSFM, of course, doesn’t make any of that knowledge obsolete, but it does mean a domain expert with some Python knowledge can get a credible forecast without years of ML background.

None of this is free, though. You’re now depending on somebody else’s model. Inference gets more expensive. For your domains, zero-shot probably won’t be good enough. And careful evaluation and validation become even more important.


2. Why would a foundation model even work for time series?

“I don’t believe in TSFM. This shouldn’t really work.”

That’s what I hear most often from my colleagues, and that skepticism makes sense. Language is bounded and has a finite vocabulary. “Apple” means roughly the same thing in a novel or a grocery list.

Numbers are not like that.

Numbers are continuous, and their meaning can vary widely across contexts. A “100” in retail demand would have a very different meaning compared to a “100” in a heart-rate trace.

So why should we hope a pretrained model can work across different contexts?

Well, the model isn’t really learning your specific data; it’s learning shapes such as cycles, trends, level shifts, recurring spikes, and those shapes recur across time series of various domains. The shapes are the “vocabulary” here, and there are far fewer of them than there are possible numeric values. A model that has seen enough of them at enough scales and frequencies can hopefully recognize them in your series, even though it has never seen or been trained on your series before.

Empirically, we have concrete numbers to support this: Chronos-2 currently holds the leading position in zero-shot accuracy across several benchmarks. In addition, recent work shows that Chronos-2 actually beats classical statistical baselines and specialized deep learning architectures, especially at longer horizons, with no task-specific tuning [2].

Of course, that doesn’t mean it always works. Some domains really are unlike anything in the pretraining mix. We’ll come back to that in question 5. But something to keep in mind: TSFM zero-shot is now the baseline to beat, not the other way around.


3. What is Chronos-2, specifically?

In this section, we briefly discuss important aspects of Chronos-2: how it’s used at inference, how it’s built, what it was trained on, and a few practical specs. For more detailed technical discussions, please refer to the original paper [1].

3.1 How is it used at inference?

Practitioners often care the most about how to actually use the model. So let’s start with that.

Chronos-2 offers different usage patterns for performing forecasting. The good news is: you don’t really need to pick different configurations for different forecasting tasks. Instead, you organize the inputs to let the model know what to do.

The key mechanism is the “group ID“.

In the Chronos-2 framework, a “group” is a concept used to represent relatedness. Every time series fed into Chronos-2 should belong to a “group”, identified by an ID. Based on how you assign these IDs at inference time, you can have the following four patterns:

  • Univariate forecasting: This is when you want to forecast one single target. You assign each series its own group ID, and Chronos-2 will simply treat each series independently.
  • Multivariate forecasting: This is when you want to forecast multiple targets at the same time because they might move together or you want their predictions to be mutually informed. To achieve that, you need to give one shared group ID to all related series.
  • Covariate-informed forecasting: This is when you have additional series that influence your target. Those extra series are commonly known as covariates, and they can be known in the past, and often also known into the future. To perform covariate-informed forecasting, you need to assign the target and its covariates to the same group ID, with the target identified as the series to forecast and the others provided as known context.
  • Cross-learning: This is when you have many related series and want the forecast for one series to benefit from patterns shown in other series. This scenario is different from the covariate-informed forecasting, because every series in the group is itself a target. They’re peers that inform each other, not auxiliary inputs like covariates. Also, this scenario is semantically different from multivariate forecasting. In multivariate forecasting, the series in a group are usually different target dimensions of the same problem (e.g., different load components). In cross-learning, the grouped series are peers (total loads from different buildings). Nevertheless, the underlying mechanism is the same: you assign all related series the same group ID, so group attention can allow information flow across series.

So, same model, same configurations. The only thing that changes is how the inputs are organized.

Figure 1. How group IDs define Chronos-2 forecasting modes: univariate, multivariate, covariate-informed, and cross-learning. (Image by author)

3.2 How was it built?

Chronos-2 is an encoder-only Transformer with 120M parameters, which is quite small if you judge by today’s LLM standards. A couple of design choices that are worth highlighting here:

1. Chronos-2 uses continuous patch embeddings, not discrete vocabulary tokens.

If you know about the original Chronos, you might remember that Chronos encodes time series by first scaling the time series values, and then quantizing them into one of the bins. Those bins are naturally treated as “tokens.”

Chronos-2, however, drops this approach.

It works by grouping consecutive observations into a “patch” and embeds the whole patch directly as a continuous vector. If you are familiar with Vision Transformers (ViT), you’d immediately see the similarities. By doing that, the model can process far fewer items per series, which leads to faster inference and introduces no precision loss caused by quantization.

2. Inside the encoder, Chronos-2 has two kinds of attention, and it alternates them in each layer.

Chronos-2 has time attention and group attention mechanisms.

For time attention, as implied by its name, its purpose is to let each series attend to its own past, thus capturing the temporal structure.

For group attention, it works across series. Specifically, it means that at each time position, all series sharing a group ID can attend to one another.

Concretely, if you think of the input as a matrix, where the rows are the time series and columns are the time, then time attention happens within a single row, while group attention happens within a single column.

The two attention mechanisms alternate layer by layer. As a result, information eventually flows both temporally within each series and across the group.

3. For the output, Chronos-2 employs a direct quantile regression head instead of autoregressive token generation.

In Chronos-2, the prediction is not done in an autoregressive fashion. Instead, it is a regression head that outputs all 21 quantile level predictions for all steps in the forecast horizon in one shot.

Put together, these choices make Chronos-2 both fast and probabilistic by default.

3.3 What was it trained on?

For a time-series foundation model, training data is a crucial ingredient.

Because real-world corpora are usually scarce, Chronos-2 relies heavily on a large amount of synthetic data. They come in two tracks:

For univariate forecasting, synthetic series are generated from three generators: Gaussian-process curves (KernelSynth, inherited from Chronos-1), random combinations of trend/seasonality/irregularity (TSI), and series sampled from random temporal causal graphs (TCM).

For multivariate (and covariate-informed) settings, all the training data is synthetic. The Chronos-2 team developed a tool called “multivariatizers”, which takes several univariate series from the generators above and imposes dependencies between them, such as same-time correlations or time-shifted ones (lead-lag, cointegration).

In fact, the most striking finding from the paper is that synthetic data alone is almost enough: a variant trained only on synthetic data performed only slightly worse than the final model.

3.4 Practical specs

Finally, a couple of practical specs worth knowing if you want to use the model:

  • Its maximum context length is 8192 steps.
  • Its maximum forecast horizon is 1024 steps (well over a year of daily data or six weeks of hourly).
  • Its license is Apache-2.0.
  • Both CPU and GPU inference are supported.

That’s Chronos-2 in concept.


4. What new things can we actually do with Chronos-2?

In this section, let’s get hands-on and do some actual forecasting with Chronos-2.

Here, we consider a small case study with a synthetic building electricity-demand forecasting problem. Specifically, we want to do hourly electricity demand forecasts one week ahead. For most buildings, we have 45 days of recent recording data. For one newly onboarded building, only a few days are available. This lets us test a cold-start setting later.

For this case study, we use physically simulated data. The main target is the total demand, which is the sum of base load, plug load, lighting load, and HVAC load. Physically, the plug and lighting loads follow weekday occupancy patterns, and the HVAC load responds to outdoor temperature and the building’s thermal dynamics. For detailed data simulation and generation, please refer to the notebook I attached at the end of this post.

Figure 2. Total load time series for all the buildings. All buildings have 45 days of history, except building 06, which has only 3 days of recording. (Image by author)

4.1 Setting up the Chronos-2 model

On the tooling side, we’ll consume Chronos-2 model through the chronos-forecasting Python package. We’ll need PyTorch, Pandas, and the usual scientific Python stack:

pip install chronos-forecasting pandas numpy matplotlib

The model weights themselves are hosted on Hugging Face under amazon/chronos-2. You can instantiate the pipeline with:

from chronos import Chronos2Pipeline
pipeline = Chronos2Pipeline.from_pretrained("amazon/chronos-2", device_map="cuda")  # or device_map="cpu"

The first from_pretrained call would download the weights into your local Hugging Face cache (~/.cache/huggingface/), while subsequent calls load from disk. The model takes about 478 MB on disk.

Chronos-2 can run on CPU, but GPU inference is usually preferred.

I am using my personal laptop with NVIDIA RTX 2000 Ada (8GB VRAM). For this notebook’s workload with hourly data, 45-day context, 168-hour horizon, 8 buildings, the univariate forecast (8 series) completes in ~0.07s, the multivariate forecast (8 buildings × 4 targets = 32 series) takes ~0.22s, and the covariate-informed forecast takes ~0.27s. Peak GPU memory stays under 1GB throughout. A production workload at a very different scale (say, more series, longer context length/forecast horizon, etc.) would definitely have a very different throughput.

4.2 The univariate forecasting

Can Chronos-2 forecast building demand zero-shot?

The first thing we would like to try is the simplest setup possible: we hand Chronos-2 each building’s recent demand history and ask for a week-ahead forecast. That’s it, nothing fancy.

We first generate the synthetic data:

full_df = make_dataset()

This is how the data looks:

          building           timestamp  ...  solar_irradiance  is_weekend
0      Building 01 2025-03-01 00:00:00  ...               0.0           1
1      Building 01 2025-03-01 01:00:00  ...               0.0           1
2      Building 01 2025-03-01 02:00:00  ...               0.0           1
3      Building 01 2025-03-01 03:00:00  ...               0.0           1
4      Building 01 2025-03-01 04:00:00  ...               0.0           1
...            ...                 ...  ...               ...         ...
32635  Building 08 2025-08-17 19:00:00  ...               0.0           1
32636  Building 08 2025-08-17 20:00:00  ...               0.0           1
32637  Building 08 2025-08-17 21:00:00  ...               0.0           1
32638  Building 08 2025-08-17 22:00:00  ...               0.0           1
32639  Building 08 2025-08-17 23:00:00  ...               0.0           1

[32640 rows x 11 columns]

This dataset contains the following columns:

['building', 'timestamp', 'total_load_kw', 'hvac_load_kw', 'plug_load_kw',
 'lighting_load_kw', 'indoor_temp_c', 'outdoor_temp_c', 'occupancy',
 'solar_irradiance', 'is_weekend']

The building column is the group ID column; total_load_kw is the target column that we aim to forecast.

Then, we can prepare the historical context and make the forecast with predict_df API:

history_df = full_df[
    (full_df["timestamp"] >= context_start_date)
    & (full_df["timestamp"] < cutoff_date)
].copy()

context_univariate = history_df[["building", "timestamp", "total_load_kw"]]

pred_univariate = pipeline.predict_df(
    context_univariate,
    prediction_length=168,            # one week of hourly forecasts
    quantile_levels=[0.025, 0.5, 0.975],   # 95% confidence interval
    id_column="building",
    timestamp_column="timestamp",
    target="total_load_kw",
)

Here is what pred_univariate look like:

         building           timestamp  ...         0.5       0.975
0     Building 01 2025-07-14 00:00:00  ...  175.027161  194.386108
1     Building 01 2025-07-14 01:00:00  ...  177.673050  198.921997
2     Building 01 2025-07-14 02:00:00  ...  175.633270  199.574677
3     Building 01 2025-07-14 03:00:00  ...  167.960052  192.789505
4     Building 01 2025-07-14 04:00:00  ...  154.674759  178.479599
...           ...                 ...  ...         ...         ...
1339  Building 08 2025-07-20 19:00:00  ...  103.391228  164.987076
1340  Building 08 2025-07-20 20:00:00  ...  116.543739  185.808151
1341  Building 08 2025-07-20 21:00:00  ...  135.177704  202.919937
1342  Building 08 2025-07-20 22:00:00  ...  150.139679  216.866089
1343  Building 08 2025-07-20 23:00:00  ...  160.572784  219.451172

[1344 rows x 7 columns]

with the following columns:

['building', 'timestamp', 'target_name', 'predictions', '0.025', '0.5', '0.975']

The median forecast is stored in predictions and the requested quantile columns (0.025, 0.975) for each (building, hour) are under 0.025 and 0.975, respectively.

For visualizing the results, we pick building 03 as an example:

Figure 3. Univariate forecasting for building 03. (Image by author)

We can see that without any fine-tuning, the forecast captured well both the daily occupancy cycle and the weekday/weekend rhythm solely from the 45-day context window. What’s also shown in the figure is the 95% confidence interval, and we see that they mostly cover the ground truth.

Note that what we just did above is effectively a batch forecasting for all eight buildings. Across those buildings, zero-shot Chronos-2 produces a weighted absolute percentage error (WAPE) of 8.6%. This definitely won’t be the best performance for this specific dataset, but something credible with little effort.

4.3 Multivariate forecasting

Can Chronos-2 forecast multiple targets simultaneously?

Next, we use Chronos-2 to forecast the individual components of demand, i.e., HVAC, plug, and lighting. In our current setup, those components share underlying driving factors, and they’re correlated. A model that treats them as a system can hopefully leverage those correlations to deliver more accurate predictions.

The code is almost the same as the univariate version — only the target argument changes from a string to a list. Also, all four targets per building share one group ID, so the model can attend across them at each time position.

target_columns = ["total_load_kw", "hvac_load_kw", "plug_load_kw", "lighting_load_kw"]
context_multivariate = history_df[["building", "timestamp"] + target_columns]

pred_multivariate = pipeline.predict_df(
    context_multivariate,
    prediction_length=168,
    quantile_levels=[0.025, 0.5, 0.975],
    id_column="building",
    timestamp_column="timestamp",
    target=target_columns,           # now a list
)

This is what the produced pred_multivariate look like:

         building           timestamp  ...         0.5       0.975
0     Building 01 2025-07-14 00:00:00  ...  170.219849  185.517609
1     Building 01 2025-07-14 01:00:00  ...  175.033524  191.951599
2     Building 01 2025-07-14 02:00:00  ...  175.106644  193.513306
3     Building 01 2025-07-14 03:00:00  ...  169.450287  189.806625
4     Building 01 2025-07-14 04:00:00  ...  159.008575  177.918198
...           ...                 ...  ...         ...         ...
5371  Building 08 2025-07-20 19:00:00  ...   19.697739   24.007296
5372  Building 08 2025-07-20 20:00:00  ...   19.775898   23.811647
5373  Building 08 2025-07-20 21:00:00  ...   19.995640   24.352007
5374  Building 08 2025-07-20 22:00:00  ...   19.610260   23.614372
5375  Building 08 2025-07-20 23:00:00  ...   19.025314   22.950117

[5376 rows x 7 columns]

with the following columns:

['building', 'timestamp', 'target_name', 'predictions', '0.025', '0.5', '0.975']

The important difference is target_name:

['total_load_kw', 'hvac_load_kw', 'plug_load_kw', 'lighting_load_kw']

We have 1344 rows for each of the targets, that’s why the total row count for pred_multivariate is 5376.

Next, we check building 03 results again. In the figure below, each panel shows the forecast for one component:

Figure 4. Forecasts for multiple targets. (Image by author)

In our synthetic case, the plug and lighting loads are driven mostly by routine that follows the weekday occupancy schedule, and the model picks them up easily from the 45-day context. HVAC load is more variable because it is driven by outdoor temperature dynamics that the model has to infer from the demand pattern alone (keep in mind that it doesn’t see temperature explicitly yet, but we will fix that later). As a result, we see some clear discrepancies in HVAC load predictions.

Component-wise, we have 15.4% WAPE for HVAC load, 4.6% for lighting load, 2.2% for plug load, and 5.4% for the total load.

It is actually quite interesting to see that the total-load WAPE in the multivariate setup is also lower than the univariate baseline (which is 8.6%) we produced in the previous section. In the current multivariate case, the model leveraged correlation patterns between different components to better infer what the total load would look like.

From a practical perspective, it is also nice to have a single predict_df call to return forecasts for the whole load breakdown with consistent treatment. This can be very useful for many downstream operations, as now the operator knows not just how much demand to expect but also where it’s coming from. This can inform designing effective HVAC scheduling, lighting controls, and peak-shaving strategies.

4.4 Covariate-informed forecasting

Can Chronos-2 use known future weather and operating schedules?

Many real-world forecasting problems come with information about the future that we already know. For our building demand problem, we know the future weather and operating schedule. Therefore, we should hand them to the Chronos-2 model and ask it to better inform its predictions.

This is Chronos-2’s covariate-informed mode from section 3.1. Here, target and covariates share the same group ID, but only the target gets predicted, and the covariates need to be supplied for both the historical window, so the model learns their relationship to demand, and the forecast horizon, so it can condition on their known future values.

Figure 5. Known covariates include outdoor temperature, occupancy schedule, and solar irradiance. (Image by author)

The figure above shows known-future signals (i.e., outdoor temperature, occupancy schedule, and solar irradiance) we’ll condition on for Building 03. Additionally, is_weekend is also included as a categorical covariate.

In a real deployment, these might come from a weather service and a building management system. Here, we produce them using the same simulator that generated the demand history.

The code requires two changes from the univariate case: the historical context now includes the covariate columns (alongside the target), and a future_df argument holds the covariate values for the forecast horizon.

future_truth_df = full_df[
    (full_df["timestamp"] >= cutoff_date)
    & (full_df["timestamp"] < cutoff_date + pd.Timedelta(hours=168))
].copy()
future_covariates_df = future_truth_df[
    ["building", "timestamp", "outdoor_temp_c", "occupancy", "solar_irradiance", "is_weekend"]
].copy()

known_future_columns = ["outdoor_temp_c", "occupancy", "solar_irradiance", "is_weekend"]
context_with_covariates = history_df[["building", "timestamp", "total_load_kw"] + known_future_columns]

One concrete row from context_with_covariates:

   building  timestamp  total_load_kw  outdoor_temp_c  occupancy  solar_irradiance  is_weekend
Building 01 2025-05-30     190.671989       30.232122   0.000553               0.0           0

One concrete row from future_covariates_df:

   building  timestamp  outdoor_temp_c  occupancy  solar_irradiance  is_weekend
Building 01 2025-07-14       30.669441   0.000553               0.0           0

Notice that we supply both dataframes to the API:

pred_with_covariates = pipeline.predict_df(
    context_with_covariates,
    future_df=future_covariates_df,     # covariate values for the forecast horizon
    prediction_length=168,
    quantile_levels=[0.025, 0.5, 0.975],
    id_column="building",
    timestamp_column="timestamp",
    target="total_load_kw",              # still a single string — one target
)

First few rows of pred_with_covariates:

   building           timestamp   target_name  predictions      0.025        0.5      0.975
Building 01 2025-07-14 00:00:00 total_load_kw   169.025482 156.235291 169.025482 183.735748
Building 01 2025-07-14 01:00:00 total_load_kw   174.691132 161.001785 174.691132 190.913406
Building 01 2025-07-14 02:00:00 total_load_kw   174.087845 158.777023 174.087845 191.526764
Building 01 2025-07-14 03:00:00 total_load_kw   170.675781 155.457169 170.675781 188.504807
Building 01 2025-07-14 04:00:00 total_load_kw   160.536011 145.712753 160.536011 177.670593
Building 01 2025-07-14 05:00:00 total_load_kw   154.637436 140.687515 154.637436 170.052383
Building 01 2025-07-14 06:00:00 total_load_kw   155.766968 141.554367 155.766968 170.922150
Building 01 2025-07-14 07:00:00 total_load_kw   167.152252 152.965271 167.152252 182.419205

The figure below shows the new prediction results for Building 03:

Figure 6. Building 03 forecasting: with vs without covariates. (Image by author)

We can see clear improvements in forecasting accuracy when the Chronos-2 model has access to the informative covariates. Across all eight buildings, the WAPE drops to 4%, a big improvement from the un-informed 8.6%.

The practical takeaway is this: when reliable information about the future is available, hand it to the model.

4.5 Cross-learning

Can related buildings help a newly metered building?

The final scenario we are investigating here is the one that a TSFM is best positioned to handle in principle: a cold start.

Imagine Building 06 has just been connected to the monitoring platform, so we have only three days of meter history for it. Three days are usually not enough to fit a reasonable building-specific model traditionally. But now that we have a TSFM, a natural question is: can the other seven buildings, each with 45 days of history, help forecast Building 06?

This is Chronos-2’s cross-learning mode that we have discussed in section 3.1. Implementation-wise, all eight buildings should share one group ID. The model doesn’t need to be told that the buildings are related; it naturally picks up usable patterns through group attention across the series. Also, in this study, we deliberately drop future covariates, so no future weather or schedule is being passed. This way, we’d know that any improvement has to come from peer histories alone.

We build two dataframes:

short_building = "Building 06"
short_history_start = cutoff_date - pd.Timedelta(days=3)
context_univariate = history_df[["building", "timestamp", "total_load_kw"]].copy()

cold_context = pd.concat(
    [
        context_univariate[context_univariate["building"] != short_building],
        context_univariate[
            (context_univariate["building"] == short_building)
            & (context_univariate["timestamp"] >= short_history_start)
        ],
    ],
    ignore_index=True,
).sort_values(["building", "timestamp"])

cold_context_new_only = cold_context[cold_context["building"].eq(short_building)].copy()

Here are the first few rows of cold_context:

   building           timestamp  total_load_kw
Building 01 2025-05-30 00:00:00     190.671989
Building 01 2025-05-30 01:00:00     181.611690
Building 01 2025-05-30 02:00:00     177.875806
Building 01 2025-05-30 03:00:00     166.297421
Building 01 2025-05-30 04:00:00     154.846159
Building 01 2025-05-30 05:00:00     151.078626
Building 01 2025-05-30 06:00:00     157.557114
Building 01 2025-05-30 07:00:00     155.563899

With the following counts:

building
Building 01    1080
Building 02    1080
Building 03    1080
Building 04    1080
Building 05    1080
Building 06      72
Building 07    1080
Building 08    1080

Here are the first few rows of cold_context_new_only:

   building           timestamp  total_load_kw
Building 06 2025-07-11 00:00:00     101.528455
Building 06 2025-07-11 01:00:00     117.270784
Building 06 2025-07-11 02:00:00     111.178600
Building 06 2025-07-11 03:00:00     110.586007
Building 06 2025-07-11 04:00:00      98.715046
Building 06 2025-07-11 05:00:00     100.550960
Building 06 2025-07-11 06:00:00     114.863499
Building 06 2025-07-11 07:00:00     125.766400

We run the following A/B testing:

# Isolated: only Building 06's 3-day history
pred_isolated = pipeline.predict_df(
    cold_context_new_only,
    prediction_length=168,
    quantile_levels=[0.025, 0.5, 0.975],
    id_column="building",
    timestamp_column="timestamp",
    target="total_load_kw",
    cross_learning=False,
)

# Cross-learning: Building 06's 3-day history + 7 siblings' 45-day histories
pred_cross = pipeline.predict_df(
    cold_context,                   # includes all 8 buildings
    prediction_length=168,
    quantile_levels=[0.025, 0.5, 0.975],
    id_column="building",
    timestamp_column="timestamp",
    target="total_load_kw",
    cross_learning=True,            # attend across the group
)

The results are shown below:

Figure 7. Cross-learning results for Building 06. (Image by author)

The lower panel shows the two forecast results together with the ground truth. The “isolated forecast” is the result when Chronos-2 only uses the three days of data as the context. We can see that it managed to capture the daily cycle somehow, but missed the weekly rhythm and underestimates the peaks. The cross-learning version, on the other hand, effectively learned to pull the weekday/weekend shape and peak magnitude from the other buildings, thus yielding better demand predictions. In terms of WAPE, it drops from 22.2% in the isolated learning strategy to 16.7% in the cross-learning strategy.

Note that the cross-learning we are doing here is not that the model peeked at other buildings’ futures, because only histories are in the model’s context. What the model is doing is in-context learning: it sees seven buildings in this portfolio, then cross-checks the kind of patterns Building 06’s three days show, and finally projects forward accordingly.


5. Where does zero-shot stop being enough?

Before getting excited about Chronos’ new capabilities we just saw in the previous case study, we should always keep this in mind: Zero-shot is a great default; it isn’t the universal answer.

So, where does zero-shot stop being enough? I believe the following four signals are important to watch for:

  1. Your data looks unlike anything in the pretraining mix. For example, specialized scientific signals, or niche sensor types.
  2. You have lots of clean history that isn’t being used. Chronos-2 pays no attention to anything past the context window. If that history exists and contains patterns Chronos-2 hasn’t seen, you probably need fine-tuning to explicitly encode it.
  3. You see systematic errors the model keeps making. For issues like that, no amount of context engineering will likely fix it. You need targeted adaptation to bridge the gap.
  4. You need behavior the zero-shot objective doesn’t optimize for. If your downstream cost is asymmetric, e.g., under-forecasting demand costs you ten times what over-forecasting does, fine-tuning with your specific loss function might be the way to go.

This is where Part 2 picks up! In the next post, we’ll discuss how to fine-tune Chronos-2.

You can find the full notebook here:


References

[1] Chronos-2: From Univariate to Universal Forecasting, arXiv, 2025.

[2] Time Series Foundation Models as Strong Baselines in Transportation Forecasting: A Large-Scale Benchmark Analysis, arXiv, 2026.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button