Why I Don't Trust LLMs to Determine When the Weather Changes

they have a simple problem: they show you the forecast, but they don't tell you when it actually changed.
That may sound trivial. That's not the case.
Modern numerical weather forecast (NWP) systems – such as the ECMWF IFS – produce remarkably accurate forecasts at ~9 km resolution, updated every few hours. The data is already very good.
The problem is not the forecast.
The problem is that attention: to know when a change in that data is important.
I didn't learn that in software engineering. I read it years ago, studying chaos theory at Instituto Balseiro. It was then, working with dynamic programming, that I first encountered a slightly negative idea:
The system can be completely deterministic and unpredictable.
That thought stayed with me. And years later, when I started building AI programs, I realized that many of them were ignoring you.
The problem with “vibe-based” deltas.
When I first saw how engineers were building weather agents, I noticed a pattern:
- Download forecast data
- Apply for LLM
- Ask: “Has the weather changed a lot?”
At first glance, this seems reasonable. From a physics point of view, it is problematic – at least in problems where the decision boundary is already well defined – because it replaces a well defined boundary with a possible definition.
In a chaotic system, value is not a linguistic judgment — a limit defined by variables such as temperature, precipitation, or wind speed. It depends on the size, context, and time horizons.
LLM is a stochastic process. It's great for generating language, but it's not designed to enforce strict limits on physical systems.
When you ask the LLM whether the forecast has “changed significantly,” you are asking a probabilistic model to estimate a deterministic rule that would clearly describe it. That introduces diversity where you want to compromise.
The failure modes are subtle:
- Trends are based on phrases rather than data
- Inconsistent decisions for all the same
- Results that cannot be tested or reproduced
In most systems, that may be acceptable. In agriculture, energy, and transportation — where a 3°C drop is a plant phase change, an indirect increase in energy demand, or an operational disruption — it is not. These decisions need to be stable and defined.
Which led me to a simple rule:
If you can write an assertion in it, you probably shouldn't use the information.
My approach to this problem
My work does not look like a straight line and more like a trajectory in phase space. Marie Curie PhD in Climate Dynamics, five years directing R&D at the national meteorology center of Uruguay – forest fire prevention, seasonal forecasting, climate adaptation – then switching to ML production at Microsoft and Mercado Libre.
That arc gave me something specific: I already understood the physics of data, the skill horizons of models, and what “significant change” really means in a physical system. Not as a software abstraction – as a measurable delta of variables with known uncertainty bounds.
When I started building AI systems, the instinct was immediate: this is a limit problem. Thresholds are in the code, not in the notification.
Skygent is one embodiment of that idea – an agent designed not to show predictions, but to detect meaningful changes in them.
The system works continuously on real-time forecast data for user-defined events, checking for changes every few hours and only triggering alerts when predefined conditions are met. In practice, many testing cycles lead to no warning – only a small fraction of changes exceed the significance threshold. That's the point: signal, not noise.
Buildings
Skygent follows a clean separation across five layers:
Only one layer costs the LLM.
The Decisive Gatekeeper
There is actually a Python tester. It doesn't interpret – it calculates. It:
- Compares Pydantic's verified sequence of forecast summaries
- Checks deltas against adjustable limits
- Includes context: event type, variable sensitivity
- Accounts for the horizon using fixed NWP skill limits – a change in the 24-hour forecast does not carry the same reliability as a change in the 10-day forecast
This is where decisions are made. Each alert has a traceable path: which variable changed, by how much, which threshold was exceeded. In a business or government environment, be able to explain why fired warning – other than “model sounded like” – is optional.
The Trigger
The warning only fires if the limit is exceeded. If the delta does not exceed the limit, nothing happens. This is a binary, experimental situation — not a judgment.
The narrator
Only after the decision is made does the LLM enter the pipeline. Its role is strictly limited: take JSON structured data, translate it into natural language.
# Structured payload sent to GPT-4o-mini
{
"event_name": "Ana's Wedding",
"variable": "precipitation_probability_max",
"from_value": 10.0,
"to_value": 50.0,
"delta": 40.0,
"horizon_days": 5.2,
"confidence": "medium"
}
Output:
“Chance of rain increased from 10% to 50% in your event window. Confidence is medium due to 5 day horizon.”
The LLM does not decide anything. It explains.
Why is this structure being tested?
It is almost impossible to achieve 100% test coverage in a pure LLM agent – you cannot write a deterministic assertion on possible outcomes.
The hybrid approach changes this. The decision logic is pure Python with a proven Pydantic implementation: 204 assessment units, no LLM dependency on the assessment programme. LLM carries only narrative tone — the one thing that is truly beneficial to natural language generation.
This is not just an experimental utility. It means all decisions
system makes can be described, reproduced, and verified outside of LLM.
Event-Driven LLM Invitational
The naive agent calls LLM in every voting cycle. This one doesn't.
Skygent checks every 6 hours. It only calls the model when the threshold is exceeded – about once or twice a week per monitored event, compared to ~28 calls for the naive polling agent.
At gpt-4o-mini prices (~$0.0001 per account), costs are negligible. More importantly, the cost is proportional to the actual experience: you pay for the LLM call only if something worth communicating happens.
A concrete example
Previous summary: 10% chance of rain, high temperature 22°C, wind 15 km/h
Current summary: 50% chance of rain, high temperature 21.4°C, Wind 18 km/h
Limitation: Warning if rain probability Δ > 20pp
Test quantity: Every 6 hours
Result: Alert activated → GPT-4o-mini generates narration → Call delivery

When this pattern breaks
This method does not work everywhere. It breaks when:
- Input is disorganized or unclear
- Decision parameters cannot be programmed as parameters
- Thinking is open
In those cases, LLM-first architectures – ReAct, Plan-and-Execute – make more sense.
One fair caveat: the limitations in Skygent are adjustable defaults – reasonable starting points are informed by meteorological practice, but not measured against historical forecasting errors for specific use cases. Measuring against actual results is a natural next step for any direct deployment. The pattern is loud; initial parameters.
It closes
The most important decision I made to build this program was not choosing a model or framework.
It was decided where not applying for an LLM.
There is a current tendency to delegate more to language models – to let them understand things. But some problems already have a structure. Some decisions are already restrictive.
If they do, measuring them with language is the wrong move. Encoding them is obviously better.
In practice, this often comes down to a simple distinction: use LLMs to define decisions, not to replace well-defined ones.
Full implementation – importance checker, LangGraph pipeline, Telegram bot – available at: github.com/ferariz/skygent
Fernando Arizmendi builds AI systems for productivity at the intersection of hard science and applied AI engineering. He is a physicist (B.Sc. & M.Sc.) from Instituto Balseiro, a former colleague of Marie Curie (Ph.D. studying Climate Dynamics & Complex Systems), and previously directed R&D at the national meteorology center of Uruguay.
LinkedIn · GitHub
All photos by the author. Pipe diagram by Claude (Anthropic).



