From probabilistic to probabilistic AI models

Over the years, I've participated in many discussions about productive AI (and you probably have, too!). These discussions varied in focus, from those with the general public about the use of AI to those with professionals about the accuracy of models. No matter who I talk to, people are always fascinated and curious about what models can do.
Can LLM write a working kernel driver? It can. Can you write a song about how much you love your cat? II'm not sure. Can a diffusion model produce a realistic picture of a medieval astronaut? It can.
But, of course “can” does it mean it will be good? It turns out that what “maybe” for many models that would be a surprisingly low bar.
As someone who has studied probability or statistics, you probably know that in a large enough sample area, almost anything can happen. The challenge is not to determine whether an outcome is likely; it's understanding how likely that effect is and how we can depend on it over and over again.
That's what confuses many about probability theory: whether or not it's related to artificial intelligence. That distinction is important because building a production AI system is very different from building a demo. Demos thrive on edge cases. Production systems depend on exchange.
As AI systems become a growing and important part of workflow and decision-making, it is worth revisiting the basic assumptions from the perspective of possibilities and examine where common assumptions about the reliability of AI begin to break down.
1. Dimensions and possibility space
To be honest, talking about reliable programs is much easier than building them. To understand why reliability remains so difficult, it helps to take a step back and think about sample gaps. Let's start with the simplest cases, coin flipping. By tossing a coin: . Possible outcomes are easier to visualize because there is less room for possibility.
Now consider a language model that generates a sequence of 512 tokens and a vocabulary of about 50,000 tokens, giving a sample space of size. . The size of this sample space is almost impossible to comprehend, let alone visualize (in your head or in practice).
In such cases, where we have a large area, the region corresponding to useful, consistent, and truly correct outcomes can be surprisingly small compared to the number of viable alternatives. In other words, an ocean of possible outcomes, a pool of possibilities…
When the model returns the answer that it is possible, but not possible, we call it a hallucination. And hallucinations, then, are not a software bug. Instead, it occurs because the model is sampling regions of the probability distribution that are not zero but have a small effective value.
At first glance, you might think:
“The more data we collect, the more false ideas will disappear.”
But the challenge is that unnatural ideas appear in probabilistic systems. Sampling from a distribution always presents an opportunity to reach areas of low probability.
2. Frequentist estimates vs Bayesian expectations
When evaluating AI systems, there are often two very different approaches. The first is, more or less, a always idea: you run 1000 benchmark jobs and measure the performance. If the model solves 850 correctly, we call it an 85% accurate system.
The second is a Bayesian theory, where you start with expectations about how an intelligent system should behave and revise those beliefs when unexpected failures occur.
This distinction is important because information is rarely an independent event. Suppose the model answers nine math questions correctly. Based on that, we may assume that the probability of getting a question ten is fair to its reported accuracy.
But linguistic models are not a collection of discrete Bernoulli trials. Their results depend on the prior context, the hidden representations, and the density of related examples within the training distribution.
Which means that their performance is frequent conditional rather than static.
3. Confidence is not the same as probability
One of the most widely used functions in machine learning is the Softmax function. We often interpret Softmax results as confidence scores: “If the model outputs 0.90 of the cat, it is 90% certain. But this definition can be misleading.
Okay, back up for a second: the Softmax function says that because of the descriptive term, small differences between logs can be magnified.
Therefore, the model can be seen as very confident not because “you know” something, but because one log was slightly larger than the others and the function of the descriptor magnified the difference.
So when ChatGPT predicts the next word, what it actually does is reply:
“Of all the possible tokens, after Softmax, which one is the most likely?”
This creates what I think of as “a confident fool” problem: a system that confidently asserts something wrong because it has not learned to express uncertainty.

4. The Law of Large Numbers and why more data doesn't automatically mean more truth
The Law of Large Numbers states that as sample sizes increase, the observed values approach their expected values. This idea often motivates the use of very large datasets to train our models. After all, if a model sees enough examples, eventually it should learn the truth, right?
At first glance, this sounds logical, especially since that's how we learn! But there is an important assumption hidden in the Law of Large Numbers: the underlying distribution must remain stable.
Human knowledge and language are not stable distributions. They are constantly changing and contain contradictions, biases, and inaccuracies. The language spoken varies from place to place. Even in the same city, people used the same language, the same expressions and the same words in a different way.
As a result, the model does not need to converge “the truth.” Instead, it coalesces into dominant patterns. Therefore, if a misconception appears frequently enough in the data, the model may learn it because, statistically, it becomes the most likely continuum.
5. Stochasticity is not necessarily creativity
Many often describe AI systems as “primitive” when they produce amazing results. However, from a probabilistic point of view, something else might be going on.
Sampling temperature changes the probability that the model will select less likely tokens. Low temperature samples are predictable and safe! Those with a high temperature tend to be more varied and surprised, often leading to a greater risk of hallucinations.
Therefore, increasing the sample temperature makes the probability distribution smoother. Which means that low probability results will be taken more often. What we sometimes interpret as creativity may be model testing of potentially very small regions of the distribution.

6. From the possible to the reliable
If our goal is to build AI systems that work consistently in real environments, we need to go beyond asking if something is happening and focus on reliability. Again, it's easier said than done. But, some useful ways to do that include:
1- Using techniques such as Platt Scaling and Isotonic Regression to help align confidence scores with observed performance.
2- Using methods like Bayesian neural networks or Monte Carlo Dropout to help estimate what the model can do.
3- Using external validation methods to enforce the structure of outputs and requirements, rather than assuming that the model will naturally follow the rules.
Final thoughts
A few years ago, everyone was fascinated by AI programs that predicted the next word. Now we find that predicting the next word is part of the problem.
The difficult challenge is to predict the correct word repeatedly and reliably. Especially with new models appearing every day. With impressive models and many promises of good performance. So, the next time you see an impressive AI demo, I encourage you to ask yourself (either yourself or the person presenting the model):
“Is this what the model usually does, or is this a lucky sample?”
In a world of almost endless possibilities, almost anything is possible. Engineering, however, rarely happens. It's about what you can hope to happen again.



