ANI

We Tried 5 Ways to Recover Lost Data: The Easiest Way to Win (Kind Of)

Data Tracking Methods Not Available
Photo by the Author

# The setup

You are about to train the model when you notice 20% of your values ​​are missing. Do you discard those lines? Filled with averages? Use something more exciting? The answer is more important than you might think.

If you Google, you'll find dozens of enforcement methods, from the dead-simple (just use an explanation) to the complex (recursive machine learning models). You may think that fancy methods are better. KNN is looking along the same lines. IMICE builds predictive models. They have to be more than just batting average, right?

We thought so too. We were wrong.

# Temptation

We took the Crop Recommendation data set from the StrataScratch projects – 2,200 soil samples from 22 crop types, with characteristics such as nitrogen levels, temperature, humidity, and rainfall. Random Forest achieves 99.6% accuracy on this item. It's probably suspiciously clean.

This analysis extends our Agricultural Data Analysis project, which examines the same dataset through EDA and statistical testing. Here, we ask: what happens when clean data meets a real-world problem – missing values?

Perfect in our testing.

We introduced 20% missing values ​​(completely at random, simulating sensory failure), and tested five clustering methods:

Data Copying Methods missingData Tracking Methods Not Available

Our test was perfect; we used 10-fold cross-validation for all five random seeds (a total of 50 runs per method). To ensure that no information from the test set was leaked to the training set, our cooling models were trained only on the training sets. In our statistical tests, we used the Bonferroni correction. We also normalized the input factors for both KNN and MICE, although we did not normalize them, inputs with values ​​between 0 and 300 (precipitation) would have a greater impact than inputs with a range of 3 to 10 (pH) when calculating the range of these methods. Full code and reproducible results are available in our manual.

Then we ran it and stared at the results.

# A surprise

Here is what we expected: KNN or MICEs will win, because they are smarter. They consider the relationship between factors. They use real machine learning.

Here's what we have:

Data Copying Methods missingData Copying Methods missing

Media and Mean are tied for first place. The more complicated methods came third and fourth.

We did a statistical test. Mean vs. Median: p = 0.7. It's not even close to important. They are effectively the same.

But here's the kicker: both outperformed KNN and MICE (p < 0.001 after Bonferroni correction). Simple methods are not synonymous with luxury. They beat them.

# Wait, What?

Before you throw away your MICE registration, let's dig into why this happened.

The job was to predict. We measured accuracy. Does the model still classify plants correctly after planting? For that particular goal, what is important is to preserve the predictive signal, not the exact values.

Mean imputation does something interesting: it replaces missing values ​​with a “neutral” value that doesn't push the model into any category. It's boring, but it's safe. Random Forest can still find its decision parameters.

KNN and mice are trying hard; they estimated what the real value would have been. But by doing so, they can introduce noise. If the nearest neighbors are not that similar, or if iterative MICE modeling finds spurious patterns, you may be adding error rather than removing it.

The foundation was already up. With 99.6% accuracy, this is an easy classification problem. If the signal is strong, the measurement errors are significant. The model can afford some noise.

The random forest is robust. Tree-based models handle incomplete data well. The linear model struggles more with different explanatory biases.

Data Tracking Methods Not AvailableData Copying Methods missing

Not immediately.

# Plot Twist

We measured something else: the preservation of communication.

Here's the thing about real data: features don't exist alone. They moved together. In our dataset, when soils are high in Phosphorus, they tend to be high in Potassium as well (correlation of 0.74). This is not just random; farmers often combine these elements together, and certain types of soil keep both of them the same.

If you enter missing values, you may accidentally break this relationship. Mean imputation fills in “mean Potassium” regardless of how Phosphorus appears in that row. Do that enough times, and the connection between P and K starts to fade. Your input data may look good column by column, but the relationships between columns are slowly deteriorating.

Why is this important? If your next step is clustering, PCA, or any analysis where factor relationships are the point, you're working with corrupted data and you don't even know it.

We checked: after insertion, how much of that P↔K correlation survived?

Data Copying Methods missingData Copying Methods missing
Photo by the Author

The standards have completely changed.

KNN maintained the correlation almost perfectly. Mean and Media destroyed about a quarter of it. And Random Sampling (sample values ​​independently in each column) removed the relationship.

This makes sense. Mean imputation replaces missing values ​​with the same number regardless of what other factors look like. If the line has high Nitrogen, the Mean does not matter; it still sets the potassium ratio. KNN looks for similar rows, so if the top N rows have K in common, it will set the top K value.

# The Trade-Off

Here's a real takeaway: there's no single best way to act. Instead, choose the most appropriate method based on your specific goal and context.

Accuracy level and communication levels are almost different:

Data Tracking Methods Not AvailableData Copying Methods missing
Photo by the Author

(At least the Random Sample doesn't change – it's bad for everything.)

This trade-off is not unique to our dataset. Consider how these methods work. Mean/Median is static, and looks at one column at a time. KNN/MICE is multivariate, and takes relationships into account. Invariant methods preserve the marginal distribution but destroy the correlation. Multivariate methods retain structure and may introduce some form of prediction error/noise.

# So, What Should You Actually Do?

After doing this research and digging through the literature, here is our working guide:

Use the Mean or Median if:

  • Your goal is to predict (planning, regression)
  • Using robust modeling (Random Forest, XGBoost, neural nets)
  • Deficiency rate less than 30%
  • You need something fast

Use KNN if:

  • You need to maintain the feature relationship
  • The work that falls down is clustering, PCA, or visualization
  • You want to correlate to continue the exploratory analysis

Use MICE if:

  • You need valid standard errors (for statistical interpretation)
  • You report confidence intervals or p values
  • The missing data method may be MAR (Not Available at Random)

Avoid Random Sampling:

  • Tempting because it “keeps spreading”
  • But it destroys the whole multivariate structure
  • We couldn't find a good version to use

# Serious Warnings

We tested one data set, one missing value (20%), one method (MCAR), and one downstream model (Random Forest). Your setup may be different. The literature shows that for some datasets, MissForest and MICE generally perform better. Our finding that simple methods compete is true, but not all.

# The Bottom Line

We went into this experiment expecting to make sure that the complex methods of forcing are worth the complexity. Instead, we found that in terms of predictive accuracy, the parsimonious definition held its own, while completely failing to preserve the relationship between features.

The lesson is “not always using direct imputation.” It is “know what you are preparing for.”

Data Copying Methods missingData Copying Methods missing
Photo by the Author

If you just need predictions, start simple. Check if KNN or MICE is really useful for your data. Don't think they will.

If you need a correlation structure for stream analysis, Mean will quietly break it down while giving you perfectly reasonable accuracy numbers. That's a trap.

And whatever you do, we measure your features before using KNN. Trust us this time.

Nate Rosidi he is a data scientist and product strategist. He is also an adjunct professor of statistics, and the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Nate writes about the latest trends in the job market, provides interview advice, shares data science projects, and covers all things SQL.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button