Why AI Trains Its Trash (and How to Fix It)

in AI for a while, you may be an LLM/Agent/Chat user, but have you ever wondered how these tools will be trained in the near future, and what if we already use the data we need to train the models? Many theories say that we are running out of high-quality, human-generated data to train our models.
New content goes up every day, that's true, but an increasing share of what's added every day is itself generated by AI. So if you continue to train with public web data, you end up training with the results of your predecessors. A snake eats its own tail. Researchers call this phenomenon Model Collapse, where AI models begin to learn from previous mistakes until the entire system collapses into nonsense.
But what if I told you that we don't actually run out of data? We've been looking in the wrong place.
In this article, I will break down the important details of this brilliant paper.
The Web We Already Use and the Web That Matters
Most of us consider the web as a unique source of information. In fact, there are at least two.
There's the Surface Web: the index, the public world like what we find on Reddit, Wikipedia, and news sites. This is what we have already admired and overused for years to train mainstream AI models today. Then, there is what we call the Deep Web, and here I am not talking about the “Dark Web” or anything illegal.
The Deep Web is everything behind a login or firewall. It means anything online that isn't paid for publicly. It could be your hospital's patient portal, your bank's internal dashboard, corporate document archives, private databases, and years of email sitting behind the login screen. Common, boring, but incredibly valuable data.
Many studies suggest that the Deep Web is orders of magnitude larger than the surface web. Most importantly, it is the best quality data. Compared to surface web content, which can be noisy, full of misinformation, and tightly optimized SEO. Also, it increasingly contains content designed to mislead or poison AI models. Deep web data, such as medical records or certified financial documents or other internal databases, is often cleaned, verified, and edited by people who care about its quality.
The problem? I think you can guess, it's a secret. You can't just release a million medical records without considering all the legal and ethical risks you will cause.
The PROPS Framework
This is where a new framework called PROPS (Protected Pipes) comes in. Presented by Ari Juels (Cornell Tech), Farinaz Koushanfar (UCSD), and Laurence Moroney (former Google AI Lead), PROPS serves as a bridge between this sensitive data and the AI models it requires.
The genius of PROPS is that it doesn't ask you to “give away” your data. Instead, it uses Privacy-Preserving Oracles. Think of the term as a “trusted middle man” who can look at your data, make sure it's real, and then tell the AI model what it needs to know without showing the model raw information.
These stage concepts can sound like magic as they can solve many of the data availability problems that AI models face today. But how does this actually work? Let's take the example of a medical company that wants to train a diagnostic tool on real health records. Under the PROPS framework:
- Consent: As a user, you log into your health portal and authorize certain uses of your data.
- Oracle: Think of Oracle as a digital notary. Go to your private portal (such as your hospital website) to verify that your data is correct. Instead of copying your files, it simply tells the AI program: “I have seen the original documents, and I testify that they are true.” It provides proof of authenticity without providing private data itself. Tools already exist for this, such as DECO. It is a protocol which allows users to prove that they have sent a specific piece of data to a web server over a secure TLS channel.
- Secure Enclave: This is the “black box” inside the computer hardware where the actual training takes place. We put the AI model and your private data inside and “lock the door.” No person or developer can see what is happening inside. The AI ”scans” the data and leaves only the model weights. The raw data remains locked inside until the end of time.
- Result: The model is trained with the data inside that box. Only updated “weights” (readings) are output. Raw data is invisible to the human eye.
The contributor knows exactly what they are agreeing to, and can be rewarded for their participation in a way that is measured by how valuable their particular data really is. A truly unique relationship between data owners and AI systems.
But why bother with this instead of Synthetic Data?
Some may ask: “Why bother with this complex setup when we can generate synthetic data?”
The answer is that synthetic data is a variety killer. By definition, synthetic data production tightens the middle area of the bell curve. If you have a rare medical condition that affects only 0.01% of the population, a synthetic data generator may be as smooth as “noise.”
Models trained on artificial data become worse at providing outliers. PROPS solves this by creating a secure way for real people with rare conditions or unique backgrounds to “opt in.” It turns data sharing from a privacy risk into a “data marketplace.” where valuable data gets the compensation it deserves.
It's not just about training, inference is important too
Most discussions focus on training, but PROPS has an equally interesting application on the instructional side.
For example, getting a loan today involves submitting a lot of documents: bank statements, pay stubs, and tax returns. In a PROPS-based system, they suggest using the Loan Decision Model (LDM):
- You authorize LDM to communicate directly with your bank.
- The bank verifies your balance through a privacy-preserving oracle.
- LDM makes the decision.
- The result? The lender gets a confirmed “Yes” or “No” without touching your confidential documents. This removes the risk of data leakage and makes it almost impossible for people to use fake, photoshopped documents.
What exactly prevents this from happening in 2026?
It just comes down to scale and infrastructure.
The most robust version of PROPS requires training to occur within a secure enclave with hardware (such as Intel SGX or NVIDIA's H100 TEEs). These work well on a small scale, but making them work for the large GPU clusters required for frontier LLMs is still an open engineering problem. It requires large clusters to operate in complete, encrypted synchronization.
The researchers are clear: PROPS is not a finished product yet. It is a convincing proof of concept. However, a lighter version is usable today. Even without full hardware guarantees, you can build systems that give users reasonable assurance, which is already an improvement over asking someone to email you a PDF.
My Final Thoughts
PROPS is not necessarily a “new” technology; It is a new version of existing tools. Privacy-preserving predictions have been used in the blockchain and Web3 space (like Chainlink) for years. The insight here is to realize that the same tools can solve the AI data problem.
The “data problem” is not a lack of information; it is distrust. We have more than enough data to build the next generation of AI, but it's locked behind the doors of the Deep Web. A snake does not have to eat its own tail; just needs to find a better garden.
👉 LinkedIn: Sabrine Bendimerad
👉 Average:
👉 Instagram:



