Time-Series LLMs, Explained with t0-alpha | Towards Data Science

0 1 15 minutes read

Time-Series LLMs, Explained with t0-alpha | Towards Data Science

way to understand the new time-series foundation models, so I picked a recent one I could run. t0-alpha is a 102M-parameter probabilistic forecaster from The Forecasting Company, released in June 2026. The Forecasting Company published the weights openly under Apache-2.0, which is what makes this reproduction possible: the model is small enough to run on accessible hardware, and it ships with GIFT-Eval results that can be checked outside the original lab.

The model shows the basic recipe behind many current time-series LLMs. It cuts a numerical sequence into patches, processes those patches with a causal transformer, and emits quantiles rather than a
single future line. That is close enough to language modelling to make the analogy useful, but different enough that the details matter.

I also re-ran the benchmark. On GIFT-Eval, t0-alpha reproduced its reported headline numbers exactly: CRPS 0.4941 and MASE 0.7240.

Figure 1 — Accuracy versus size on GIFT-Eval. CRPS on GIFT-Eval plotted against parameter count. t0-alpha sits in the clean competitive cluster at 102M parameters, although TiRex is both smaller and slightly more accurate. Hollow markers indicate models GIFT-Eval flags for test-data leakage. Lower CRPS is better; the vertical axis is inverted so better models appear higher.

This post uses t0-alpha to explain how time-series foundation models work, how they are evaluated, where they beat classical baselines, where they still fail, and why the next useful gains may come from calibration, routing, leakage control, stronger baselines, and domain-specific estimators rather than another small transformer variation. All images in this article are self-created by the author.

How the model turns a time series into something a transformer can read

A language model starts with tokens. A time-series foundation model has to make tokens out of numbers.

t0-alpha does this by cutting the input into fixed windows of 32 time steps. Each window becomes a patch. The model embeds those patches, passes them through a decoder-style transformer, and predicts future quantiles.

The causal part matters. When t0-alpha forecasts the next window, it can only attend to the past. It does not see the answer window during generation.

The quantile part matters too. The model is not just drawing one expected future line. It emits a set of quantiles, which represent a forecast distribution. In my run I used nine quantile levels, from
0.1 to 0.9.

That is why CRPS is a useful metric here. It rewards a model for being accurate and for putting the right amount of uncertainty around the forecast. A narrow forecast that misses badly is punished. A wide forecast that avoids being wrong by saying very little is also punished.

The exact t0-alpha configuration I evaluated was:

The most important practical property is that t0-alpha is probabilistic. Forecasting is rarely only about the middle line. A useful model should also say when it is unsure.

The second practical property is openness. The weights are available on Hugging Face under Apache-2.0, and the package installs with:

pip install tfc-t0

At 102M parameters, t0-alpha is small by foundation-model standards. It is also small enough to run on a single mid-range GPU. That combination makes it useful for a tutorial: open weights, manageable model size, and benchmark numbers that can be reproduced outside the original lab.

t0-alpha is a decoder-style patch transformer for probabilistic time-series forecasting. Raw series are split into 32-step patches, embedded, processed through causal time-attention and group-attention layers, and decoded into future quantiles rather than a single point forecast. — **Figure 2** — t0-alpha architecture. t0-alpha is a decoder-style patch transformer for probabilistic time-series forecasting. Raw series are split into 32-step patches, embedded, processed through causal
time-attention and group-attention layers, and decoded into future quantiles rather than a single point forecast.

Two kinds of time-series LLM

The phrase “time-series LLM” gets used for two different things.

The first kind is trained natively on time-series data. These models turn numerical sequences into patches or tokens, train a transformer on many forecasting datasets, and produce forecasts directly. t0-alpha, TimesFM, Toto, Chronos, TiRex and Moirai are broadly in this group, although their architectures differ.

The second kind starts with a pretrained text LLM and adapts it to forecasting. These systems reprogram, prompt, or wrap a language model so that it can process numerical sequences. Time-LLM is a
representative example of this direction.

This article is about the first kind.

The models in that first group do not all work the same way. t0-alpha, TimesFM and Toto are causal patch-based forecasters. Chronos quantises continuous values into a discrete vocabulary and uses a T5-style encoder-decoder. Moirai uses a masked encoder and is designed for arbitrary numbers of variates.

The shared recipe is now fairly stable: transformer backbone, broad time-series pretraining, probabilistic output, and zero-shot evaluation across many domains. t0-alpha is useful because it is a clean example of that recipe.

GIFT-Eval benchmark setup

I use GIFT-Eval as the main benchmark in this article. It has 97 task configurations from 55 datasets across seven domains. It includes short and long horizons, frequencies from secondly to yearly,
univariate and multivariate series, and probabilistic scoring.

It also includes several datasets that appear repeatedly in forecasting work, including M4, ETT, and a large slice of the Monash archive. That breadth makes it a useful benchmark for comparing
general-purpose forecasting models.

The two headline metrics are MASE and CRPS.

Both scores are normalised against Seasonal Naive, the baseline that repeats the previous season. A score of 1.000 means the model matches that baseline. Scores below 1.000 are better, and scores above 1.000 are worse.

In my run, t0-alpha scored 0.4941 CRPS. Under GIFT-Eval’s geometric-mean normalisation, that puts its probabilistic error at about half the Seasonal Naive baseline.

The aggregate score gives a broad comparison across the benchmark. The task-level results are still needed to see where the model is strong, where it is weak, and where a practitioner should run their own evaluation.

Re-running the example

I re-ran four foundation models end to end on one AWS g5.xlarge instance with an NVIDIA A10G 24GB GPU:

This is not a hardware benchmark. I used the same AWS g5.xlarge instance with an NVIDIA A10G 24GB GPU to keep the runs as consistent as possible.

The t0-alpha run produced CRPS 0.4941 and MASE 0.7240, matching the reported figures to four decimal places.

I also checked the scoring harness with Seasonal Naive. Seasonal Naive is deterministic and has no learned parameters, so it is a useful way to test the metric, normalisation, and aggregation code. My run reproduced the official GIFT-Eval reference almost exactly on MASE and closely on CRPS across the tasks I covered.

This gives me confidence in the scoring setup. It does not mean I independently regenerated every result in the comparison table. For models I did not run cleanly on the same hardware, I use the
official GIFT-Eval reference numbers and mark them as such.

Moirai-base, for example, did not fit comfortably on the 24GB GPU for one native-multivariate dataset. I therefore use the official GIFT-Eval number for that row. The classical baselines and several
leaderboard rows are also official references rather than my own reruns.

Run-to-run variation near the top of the leaderboard

The gaps near the top of the leaderboard are small, so I checked how much variation can appear even when running the same model.

When I re-ran Chronos-Bolt-base on my machine and compared it task by task with the official reference for the same model and code, per-task CRPS differed by up to 0.068. The mean difference was 0.0055.

That matters because the gap between t0-alpha and chronos-2 in the table below is about 0.009. I would not defend the fine ordering of the leading models too strongly. The safer reading is a competitive
cluster rather than a precise rank.

Where t0-alpha sits

All figures below are geometric-mean, Seasonal-Naive-normalised scores. Lower is better.

A few things stand out.

t0-alpha beats every classical baseline in this benchmark. It also beats two strictly larger foundation models in this table: timesfm-2.0–500m at roughly five times its size, and timesfm-1.0 at roughly
twice its size. Both of those larger models are leakage-flagged in GIFT-Eval.

t0-alpha also sits inside a tight clean cluster:

The spread from 0.481 to 0.496 is only 0.015 CRPS. Given the run-to-run variation above, I would not read this as a stable ranking. These models are close.

t0-alpha is not the best accuracy-per-parameter model here. TiRex is only 35M parameters and scores slightly better. The accuracy gap is small enough that I would not overread it, but the size
difference is real.

t0-alpha is a small, open, reproducible model that sits in the competitive cluster, although smaller clean models can match or beat it.

**Figure 3 — GIFT-Eval CRPS standings.** Models sorted by normalised CRPS, where lower is better. t0-alpha beats every classical baseline and two larger foundation models, but sits just behind a tight clean
cluster of recent time-series foundation models.

Where t0-alpha is useful

t0-alpha rarely collapses. Across the 97 tasks, it loses to Seasonal Naive on exactly one. On 96 of 97 configurations, it beats the standard seasonal-repeat baseline.

That consistency matters in practice. Many deployed forecasting systems are judged by how often they produce embarrassing failures when pointed at a new series. t0-alpha’s aggregate score is useful, but its broad consistency across tasks is at least as relevant.

One concrete example is daily US births.

This is the kind of result that makes foundation models attractive. There is no dataset-specific feature engineering and no per-series tuning, but the model still produces a strong probabilistic
forecast.

Where t0-alpha is weak

The weak spots are specific rather than broad. The worst normalised CRPS values come from long-horizon multivariate IT-observability data and two M4 frequencies.

bizitobs_application/long is the one task where t0-alpha is worse than Seasonal Naive.

The practical lesson is straightforward. If your workload looks like long-horizon multivariate observability data, do not assume t0-alpha’s aggregate score will transfer. Evaluate on your own series.

**Figure 4** — Where t0-alpha is weakest. t0-alpha’s weakest cases are concentrated in long-horizon IT-observability tasks and selected M4 frequencies rather than spread evenly across the benchmark. The one
clear failure against Seasonal Naive is bizitobs_application/long.

How strong tuned classical baselines are

The standings can make classical forecasting look weaker than it is.

ARIMA beats Seasonal Naive on 55 of 97 tasks. Its aggregate score looks mediocre because it has a few severe failures, not because it is weak everywhere.

Split by horizon, the pattern is clearer.

ARIMA beats the baseline at short horizons, matches it around medium horizons, and stays reasonably close at long horizons. Theta degrades more sharply as the horizon grows.

The largest failures mostly come from high-frequency data. GIFT-Eval runs the classical models in automatic default mode, one fit per series, with limited seasonal specification. A ten-second or hourly
series may have several seasonal cycles, but a default model may only be told about one of them.

That makes the classical baseline weaker than what a practitioner would normally deploy.

A tuned classical check

To test that, I re-ran a tuned classical model using MSTL with sensible multi-seasonal periods. The point was to check whether the default classical baselines were underpowered.

The first dataset back was the most useful one: bizitobs_application/10S, a high-frequency IT-observability series where default ARIMA collapses and where t0-alpha is also weak.

On this dataset, the classical-method collapse was mostly a missing seasonal-period problem. Once MSTL is given the right seasonal structure, it beats t0-alpha at every horizon.

That single dataset is not representative. Across eleven high-frequency tasks, tuned MSTL narrows the gap but does not remove it.

Tuning roughly halves the classical gap. Default ARIMA’s 1.299 becomes tuned MSTL’s 0.571, which is a large improvement. t0-alpha still wins on aggregate and wins on eight of the eleven high-frequency
tasks.

The low-frequency control tasks point the other way. On four daily and monthly series, tuned MSTL slightly beats t0-alpha on aggregate, 0.415 to 0.468 normalised CRPS. The sample is small, and the margin is sensitive to individual datasets, so I would treat that as a pilot rather than a settled result.

The practical guidance is:

For clean daily or monthly data, a well-specified classical model remains hard to beat. For heterogeneous high-frequency data, the foundation model earns its keep. For production, I would test both and keep the option to route between them.

Reading the leakage flags

Leakage is a serious issue in time-series foundation-model evaluation. Public datasets get reused. Pretraining corpora are often broad. Once a dataset enters a model’s pretraining set, it is no longer a clean benchmark for that model.

That makes the GIFT-Eval leakage flags important. The flags in this comparison are:

The models ahead of t0-alpha in the table, STRIDE, Toto-2.0–313m, chronos-2, TiRex and TimesFM-2.5, are not leakage-flagged. The flagged models are ones t0-alpha beats. So leakage does not explain
t0-alpha’s gap to the leaders. The leaders in this comparison are clean under the benchmark’s flags.

The leakage point runs the other way for two of t0-alpha’s wins. The larger TimesFM models it beats are flagged. That does not invalidate the comparison, but it means the result should be read
carefully.

t0-alpha is marked as non-leakage-flagged in GIFT-Eval. I rely on that benchmark label here; I did not independently inspect the full pretraining corpus.

**Figure 5** — Leakage flags in the comparison set. The models ahead of t0-alpha in this comparison are clean under GIFT-Eval’s leakage flags. The leakage-flagged models shown here are ones t0-alpha beats,
so leakage does not explain t0-alpha’s gap to the leading clean models.

Open problems in evaluation and system design

The modelling recipe now looks fairly stable. The open questions are increasingly about data, evaluation, calibration, and system design.

The next gains are likely to come from five places.

First, leakage control needs to become stricter. Forecasting benchmarks reuse many public datasets, and foundation-model pretraining sets can be broad. A leaderboard is only as useful as the separation between pretraining data and test data.

Second, calibration matters. A probabilistic forecast should not only place the median well. Its quantiles should mean what they say. If the 0.9 quantile is exceeded much more or much less than 10% of
the time, the model is giving the wrong uncertainty.

Third, routers and ensembles are likely to matter more than single-model rankings. The task-level results suggest complementarity across models. A system that knows when to trust each model may beat any single model.

Fourth, classical baselines need to be stronger. Default ARIMA is not the same thing as a practitioner’s tuned forecasting stack. If foundation models are compared only against weak automatic baselines, the comparison is incomplete.

Fifth, some domains may need specialist estimators. A general foundation model is useful when the deployment domain is broad or unknown. A repeated business problem with stable structure may reward a smaller model trained for that specific decision.

t0-alpha’s result is useful because it makes this shift visible. A small open model is now strong enough to sit near much larger systems. That makes evaluation quality more important, not less.

**Figure 6** — **Where evaluation and system design matter most.** The modelling recipe for time-series foundation models has largely converged around transformer backbones, large pretraining corpora, and
probabilistic heads. The open problems are now in evaluation and system design: leakage control, calibration, stronger baselines, routing, and ensembling.

Simulator-trained forecasting estimators

There is another route that sits slightly outside the current time-series foundation-model recipe.

Most recent time-series LLMs learn general forecasting behaviour from large pretraining corpora. That is the Chronos, TimesFM, Toto, Moirai, TiRex and t0-alpha direction: collect many real time series,
train a general model, and hope the representation transfers across domains.

The Simulacrum paper points to a different but complementary path. Instead of pretraining only on historical datasets, it builds a simulated forecasting world, samples many synthetic time series from that world, and trains a neural estimator to make the decision that matters.

For example, if the deployment problem is short retail demand series, simulate short retail demand series. If the problem is high-frequency observability data, simulate trend, seasonality, bursts,
missingness, outages, regime shifts and censoring. Then train the estimator for the actual objective: calibrated quantiles, inventory cost, capacity risk, model selection, bias reduction, or robust
prediction intervals.

This is not a replacement for time-series foundation models. It is closer to an estimator factory. A foundation model tries to be broadly competent across many unknown domains. A simulator-trained estimator tries to be near-optimal inside a carefully designed world.

There is a risk with this, however. If the simulated world is wrong, the model can be confidently wrong for the real problem. If the simulator is close to the deployment setting, the method gives
something attractive: a small, fast, task-specific forecaster whose training objective matches the business decision directly.

For this article, the point is that the next gains may not come only from larger time-series LLMs. They may also come from hybrids: foundation models for broad zero-shot robustness, classical models for clean structured cases, routers and ensembles for complementarity, and simulator-trained neural estimators for repeated domain-specific forecasting problems.

Hybrid headroom from guards and routers

Given how rarely t0-alpha loses to Seasonal Naive, I checked two simple hybrid ideas.

Both are oracle analyses. They use the answer to choose the best model per task, so they are ceilings rather than deployable systems. Ceilings are still useful because they tell us where headroom
exists.

The naive guard barely moves the score, from 0.4941 to 0.4936. That is expected. t0-alpha loses to Seasonal Naive only once, so there is little for a guard to fix.

The router result is more interesting. An oracle that selects between t0-alpha and chronos-2 per task reaches 0.4700, better than either model alone.

That is not a deployable result because the router uses the answer. It is evidence of complementarity.

The next experiment should be a real probabilistic ensemble: average the quantile forecasts per series, or learn a non-oracle router using metadata and validation performance. The oracle number tells us that headroom exists but it does not tell us how much survives in a real system.

**Figure 7** — **Oracle hybrid headroom.** A naive fallback barely improves t0-alpha because the model almost never loses to Seasonal Naive. Oracle routing between t0-alpha and complementary foundation models
gives much larger gains, suggesting that ensembling and learned routing are the more promising next experiments.

Summary: where t0-alpha lands

t0-alpha is a useful marker for where the field now is. A 102M open model can reproduce its reported benchmark numbers, beat the classical baselines in GIFT-Eval, and sit close to several recent
time-series foundation models on aggregate CRPS.

It is not alone in that cluster. TiRex is smaller and slightly ahead in this comparison, while Toto, chronos-2 and TimesFM-2.5 also sit just above t0-alpha on aggregate CRPS. The gaps are small enough
that I would not read the table as a precise ordering. I would read it as evidence that several clean, modern time-series foundation models are now operating in roughly the same performance band.

Small open time-series foundation models are now strong enough to be serious baselines. t0-alpha beats every classical baseline in the GIFT-Eval table and loses to Seasonal Naive on only one of 97 task
configurations.

That does not make classical forecasting obsolete. For clean daily or monthly data, a well-specified classical model can still be hard to beat. For heterogeneous high-frequency data, a foundation model
often earns its keep. For repeated domain-specific problems, a simulator-trained estimator may be the better route. For production systems, the useful answer is likely to be a router or ensemble rather than one universal forecasting model.

The next experiments I would run are straightforward:

First, build a real quantile ensemble of t0-alpha and chronos-2. The oracle router suggests complementarity, but the useful test is how much survives without using the answer to choose the model.
Second, run a leakage-controlled, time-forward evaluation with stronger classical baselines. The transformer recipe is no longer the only interesting part of the problem; evaluation quality matters just as much.
Third, test a simulator-trained specialist estimator on one weak segment, such as high-frequency observability data. If the simulator captures the right structure, including multiple seasonalities, bursts, outages, censoring and regime shifts, the comparison becomes more interesting than foundation model versus classical model.

In practice I would combine four things: foundation models for broad zero-shot competence, classical models for clean structured cases, learned routers for complementarity, and specialist estimators where the deployment world is repeated enough to simulate.

Disclaimer: The views and opinions expressed in this article are my own and do not represent those of my employer or any affiliated organizations. The content is based on personal experience and reflection, and should not be taken as professional or academic advice.

📚References

The Forecasting Company. (2026). t0-alpha⁠. Hugging Face Model Card. Describes t0-alpha as an open-weights transformer-based probabilistic multi-horizon forecasting model and provides the model weights, usage instructions, and Apache-2.0 licensing context.
Aksu, T., Woo, G., Liu, J., Liu, X., Liu, C., Savarese, S., Xiong, C., and Sahoo, D. (2024). GIFT-Eval: A Benchmark for General Time Series Forecasting Model Evaluation⁠. Introduced GIFT-Eval, the general-purpose zero-shot time-series forecasting benchmark used in this article, including its multi-domain task suite, normalised scoring protocol, and leakage-aware evaluation setup.
Das, A., Kong, W., Sen, R., and Zhou, Y. (2024). A Decoder-Only Foundation Model for Time-Series Forecasting⁠. International Conference on Machine Learning (ICML). Introduced TimesFM, a patch-based decoder-only time-series foundation model and one of the main comparator families in the GIFT-Eval standings.
Ansari, A. F., Stella, L., Turkmen, C., Zhang, X., Mercado, P., Shen, H., Rangapuram, S. S., Arango, S. P., Kapoor, S., Zschiegner, J., Maddix, D. C., Wang, H., Mahoney, M. W., Torkkola, K., Wilson, A. G., Bohlke-Schneider, M., and Wang, Y. (2024). Chronos: Learning the Language of Time Series⁠. Introduced Chronos, which converts continuous time-series values into discrete tokens and trains language-model architectures for probabilistic forecasting.
Jin, M., Wang, S., Ma, L., Chu, Z., Zhang, J. Y., Shi, X., Chen, P.-Y., Liang, Y., Li, Y.-F., Pan, S., and Wen, Q. (2024). Time-LLM: Time Series Forecasting by Reprogramming Large Language Models⁠. International Conference on Learning Representations (ICLR). Shows the second family of “time-series LLM” work, where pretrained text LLMs are reprogrammed for forecasting rather than trained natively on time-series patches or tokens.

Source link

nimda 3 hours ago

0 1 15 minutes read