Machine Learning at Scale: Managing More than One Model in Production

nimda March 9, 2026

0 13 6 minutes read

Machine Learning at Scale: Managing More than One Model in Production

yourself how real machine learning products work in large companies or technical departments? If so, this article is for you 🙂

Before discussing robustness, please feel free to read my first article on the basics of machine learning in manufacturing.

In this last article, I told you that I spent 10 years working as an AI engineer in the industry. Early in my career, I learned that a model in a notebook is just a mathematical hypothesis. It is only useful when your output affects the user, the product, or generates revenue.

I have already shown you what “Machine Learning in Manufacturing” looks like in one project. But today, the conversation is about Scale: managing tens, or hundreds, of ML projects simultaneously. Over the years, we have gone from The Sandbox Era of The Age of Infrastructure. “Modeling out” is now a negotiable skill; the real challenge is to ensure that a large portfolio of models operates reliably and safely.

1. Leaving the Sandbox: Discovery Strategy

To understand ML at scale, you first need to leave the “Sandbox” concept behind you. In a sandbox, you have static data and one model. If it moves, you see it, stop it, fix it.

But when you switch to Scale Mode, you're no longer managing a model, you're managing a portfolio. This is where the CAP Theorem (Consistency, Availability, and Separation Tolerance) becomes real for you. In a single model setup, you can try to measure the tradeoffs, but at scale, it is impossible to be perfect in all 3 metrics. You have to pick your battles, and more often than not, Availability becomes the priority.

Why? Because if you have 100 working models, there is something always to break. If you stopped the service every time the model moved, your product would be offline 50% of the time.

Since we cannot stop the service, we design models that will fail “cleanly.” Take the example of a recommendation system: if its model receives corrupted data, it should not crash or display a “404 error.” It should return to a safe default setting (such as showing the “10 most popular” items). The user is always happy, the program is always available, even though the result is wrong. But to do this, you need to know when triggering that backfire. And that leads us to our biggest challenge at the level…”Monitoring”.

2. The Challenge of Monitoring and Why Traditional Metrics Are Dying at Scale

By saying that at scale it is important for our system to fail “cleanly,” you might think that it is easy and we only need to test or monitor accuracy. But on the scale, “Precision” is not enough and I will tell you why:

Civil Disagreement: In Computer Vision, for example, monitoring is easy because people agree on the truth (it is a dog or not). But in the Recommendation Program or ad-ranking model, there is no “Gold Standard.” If the user doesn't click, is the model bad? Or is the user out of context?
Feature Engineering Trap: Because we can't easily measure “truth” with a simple metric, we overcompensate. We add hundreds of features to the model, hoping that “more data” will resolve the uncertainty.
Ceiling Theory: We strive for 0.1% accuracy gains without knowing if the data is too noisy to give more. We are chasing a “ceiling” that we cannot see.

So let's put all that together to understand where we're going and why this is important: Because monitoring the “truth” is almost impossible at scale (Dead Zones), we can't rely on simple warnings to tell us to stop. That's why we prioritize Availability again Fallbacks are safewe assume that the model is likely to fail without the metrics telling us, so we build a system that can survive those “mysterious” failures.

3. What about the Engineering Wall

Having discussed strategy and monitoring challenges, we are not yet ready to scale, as we are not yet talking about infrastructure. Scaling requires engineering skills as much as data science skills.

We can't talk about scaling if we don't have a strong, secure infrastructure. Because the models are complex, and because Availability is our first priority, we need to think seriously about the structures we have set up.

For now, my sincere advice is to surround yourself with a team or people who are used to building large infrastructures. You don't need a huge cluster or a supercomputer, but you need to think about these three basics:

Cloud vs. device: The server is powerful and easy to monitor, but it's expensive. Your choice depends entirely on Cost vs. To control.
Hardware: You can't put all the models on the GPU; he will enter the bank. You need an Integrated Strategy: run your light “fall back” models on cheap CPUs, and reserve expensive GPUs for heavy “money maker” models.
Upgrade: At scale, a 1-second delay in your backhaul is a failure. You're not just writing Python anymore; you have to learn to compile and optimize your code for specific chips so that the “Fail Cleanly” button happens in milliseconds.

4. Beware of Label Leaks

So, you expected failure, worked on availability, filtered monitoring, and built infrastructure. You probably think you are finally ready to be able to scale. Actually, not yet. There is a problem that you cannot expect if you have never worked in a real environment.

Even if your engineering is good, Label Leaks can ruin your strategy and your multi-model systems.

In one project, you may see a leak in the notebook. But at scale, when the data comes from 50 different pipes, the leaks are almost invisible.

Example of Churn: Imagine you are predicting which users will cancel their subscription. Your training data has a feature called Last_Login_Date. The model looks perfect with 99% F1 results.

But here's what actually happened: The database team created a trap that “clears” the entry date field when the user clicks the “Cancel” button. Your model sees a “Null” entry date and realizes, “Aha! They're canceled!”

In the real world, in exactly a millisecond the model needs to make a prediction before user cancels, that field is not yet active. The model looks for feedback from the future.

This is a basic example just to understand the concept. But believe me, if you have a complex system with real-time forecasts (which happens often with IoT), this is very difficult to realize. You can avoid it only if you recognize the problem early.

My tips:

Install Latency Monitor: Just don't be cautious value of data, monitor when was written vs. when the event actually takes place.
Millisecond test: Always ask: “At the exact time of the prediction, does this database row contain this value yet?”

Of course, these are simple questions, but the best time to check this is during the design phase, before you write a line of production code.

5. Finally, the Human Loop

The last piece of the puzzle is Accountability. At scale, our metrics are fuzzy, our infrastructure is complex, and our data is leaky, so we need a “Safety Network.”

Shadow Shipping: This is mandatory at scale. You release “Model B” but do not show its results to users. He lets it run “in the shadows” for a week, comparing its predictions to the “Truth” that finally arrives. If they are stable, only then can you promote it to “Live.”
Human-in-the-Loop: For high-priced models, you need a small team to test “Automatic.” If your system is back to “Most Popular Items” for three days, someone needs to ask why the main model has not yet recovered.

And a quick recap before you start working with ML at scale:

Since we can't be perfect, we choose to stay online (Discovery) and fail safely.
Availability is our number 1 metric as monitoring at scale is “imprecise” and traditional metrics are unreliable.
We build the infrastructure (Cloud/Hardware) to make this fail safe fast.
We are aware of “cheat” data (Leaks) that make our vague metrics look too good to be true.
We use Shadow Deploys to prove that the model is safe before it touches the customer.

And remember, your scale is only as good as your safety net. Don't let your work be among the 87% of failed projects.

👉 LinkedIn: Sabrhas Bendimerad

👉 Average:

👉 Instagram:

Source link

nimda March 9, 2026

0 13 6 minutes read