Data detox: Prepare for the messy, noisy, real world


Photo by the Author
The obvious Getting started
We've all spent hours tweaking the model, only to find out that it wasn't an algorithm but a false value that falsified your results on line 47,832. Kaggle's competitions give the impression that the data is generated as clean, labeled CSVS with class constraints, but in reality, this is not the case.
In this article, we'll use a real data project to explore four practical steps to prepare for dealing with messy, real datasets.
The obvious The NOVROKER Data Project: Real-World Chaos Hands-on Testing
NOBROKER is an Indian Property Technology (Prop-Tech) company that connects property owners and tenants directly in the broker-free marketplace.

This data project is used during the recruitment process for data science positions at NOBROKER.
For this data project, Nondroker wants you to build a predictive model that shows how many interactions a property will receive within a given time period. We won't finish every project here, but we will help you find ways to train on dirty country data.
It has three datasets:
property_data_set.csv- It contains property information such as type, location, amenities, size, rent, and other housing characteristics.
property_photos.tsv- Contains pictures of buildings.
property_interactions.csv- It contains the communication time of the communication architecture.
The obvious Compare pure communication data versus real production data: A practical test
Interview essays are soft, balanced, and boring. Actual production details? It's a dumpster fire with missing values, double lines, inconsistent formats, and silence errors waiting until Friday at 5 PM to break your pipe.
Take the NOBROker property paStaset, a real world mess with 28,888 properties across three tables. At first glance, you look good. But dig deeper, and you'll find 11,022 missing Uniform Image (URLs), corrupted Json strings with rogue backslashes, and more.
This is the line between clean and messy. Clean data trains you to build models, but production data trains you to live in combat.
We will examine four training habits.

The obvious Practice #1: Handling lost data
Lost data isn't just annoying; Decision point. Delete a line? Filled with purpose? Stuck as an unknown? The answer depends on why the information was lost and how much you can pay.
The NOUVROKER pataset had three types of missing data. This page photo_urls The column was missing 11,022 values in 28,888 rows – that's 38% of the data. Here is the code.
Here is the output.

Deleting these rows would delete important property records. Instead, the solution was to treat the missing images as if they were zero and move on.
def correction(x):
if x is np.nan or x == 'NaN':
return 0 # Missing photos = 0 photos
else:
return len(json.loads(x.replace('\', '').replace('{title','{"title')))
pics['photo_count'] = pics['photo_urls'].apply(correction)
Number columns are similar total_floor (23 missing) and column columns like building_type (38 lost), this strategy became a symbol. Fill in the value blanks with blanks, and the category blanks with mode.
for col in x_remain_withNull.columns:
x_remain[col] = x_remain_withNull[col].fillna(x_remain_withNull[col].mean())
for col in x_cat_withNull.columns:
x_cat[col] = x_cat_withNull[col].fillna(x_cat_withNull[col].mode()[0])
The first decision: Do not remove the mind from the questioning!
Understand the pattern. The missing image URLs were not random.
The obvious Practice #2: Find vendors
The frame is not always wrong, but it is always suspicious.
Can you imagine a property with 21 bathrooms, 800 years old, or 40,000 square feet of space? You found your dream property or someone made a data entry error.
NOUVROKER's database was full of these red flags. Box Plots revealed large values across multiple columns: Over 100 Property Years, sizes exceeding 10,000 square feet Some were legal luxury properties. Most were data entry errors.
df_num.plot(kind='box', subplots=True, figsize=(22,10))
plt.show()
Here is the output.

The solution to the problem of communication with people
To manage this, first we write a function that removes those sellers.
def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3 - q1
fence_low = q1 - 2 * iqr
fence_high = q3 + 2 * iqr
df_out = df_in.loc[(df_in[col_name] <= fence_high) & (df_in[col_name] >= fence_low)]
return df_out # Note: Multiplier changed from 1.5 to 2 to match implementation.
We also use this code for numeric columns.
df = dataset.copy()
for col in df_num.columns:
if col in ['gym', 'lift', 'swimming_pool', 'request_day_within_3d', 'request_day_within_7d']:
continue # Skip binary and target columns
df = remove_outlier(df, col)
print(f"Before: {dataset.shape[0]} rows")
print(f"After: {df.shape[0]} rows")
print(f"Removed: {dataset.shape[0] - df.shape[0]} rows ({((dataset.shape[0] - df.shape[0]) / dataset.shape[0] * 100):.1f}% reduction)")
Here is the output.

After removing vendors, Dataset ShraSk from 17,386 rows to 15,170, losing 12.7% of data while keeping the model to shame. The trade was worth it.
For target diversity request_day_within_3dsaturation was used instead of displacement. Values above 10 are stuck at 10 to prevent excessive sellers from trend forecasting. In the following code, we also compare the results before and after.
def capping_for_3days(x):
num = 10
return num if x > num else x
df['request_day_within_3d_capping'] = df['request_day_within_3d'].apply(capping_for_3days)
before_count = (df['request_day_within_3d'] > 10).sum()
after_count = (df['request_day_within_3d_capping'] > 10).sum()
total_rows = len(df)
change_count = before_count - after_count
percent_change = (change_count / total_rows) * 100
print(f"Before capping (>10): {before_count}")
print(f"After capping (>10): {after_count}")
print(f"Reduced by: {change_count} ({percent_change:.2f}% of total rows affected)")
The result?

Cleaner distribution, better model performance, and less debugging time.
The obvious Practice #3: Dealing with duplication and inconsistency
Repetitive is easy. Incompatibility is difficult. A double row is fine df.drop_duplicates(). An incompatible format, such as a JSON string composed of three different systems, requires the operation of a parser.
NOUVROKER's NOUVROKER data had one of the worst json inconsistencies. This page photo_urls The column was supposed to contain valid JOSS, but instead, it was filled with incorrect strings, missing quotes, back-to-back escapes, and random trailing characters.
text_before = pics['photo_urls'][0]
print('Before Correction: nn', text_before)
Here is the fix before the fix.

The fix required changing multiple strings to correct the formatting before commenting. Here is the code.
text_after = text_before.replace('\', '').replace('{title', '{"title').replace(']"', ']').replace('],"', ']","')
parsed_json = json.loads(text_after)
Here is the output.

The JSON was actually working and visible after editing. It's not a clean way to do this kind of trick, but it works.
You see inconsistent formats everywhere: Dates are stored as strings, typos in hash values, and ID values are stored as floats.
The solution is static, as we do with JSON formatting.
The obvious Practice # 4: Data type validation and schema checks
It all starts when you upload your data. Finding out later that dates are strings or that numbers are objects can be a waste of time.
In the NOBROKER project, types were validated during CSV read-in, as the project enforces the correct data types beforehand adulterous head parameters. Here is the code.
data = pd.read_csv('property_data_set.csv')
print(data['activation_date'].dtype)
data = pd.read_csv('property_data_set.csv',
parse_dates=['activation_date'],
infer_datetime_format=True,
dayfirst=True)
print(data['activation_date'].dtype)
Here is the output.

The same validation was applied to the contact data.
interaction = pd.read_csv('property_interactions.csv',
parse_dates=['request_date'],
infer_datetime_format=True,
dayfirst=True)
Not only is this a good practice, but it was important to be there down the river. The project required the calculation of days and time and the time of the difference between the performance and the requested days.
So the following code produces an error if dates are strings.
num_req['request_day'] = (num_req['request_date'] - num_req['activation_date']) / np.timedelta64(1, 'D')
Schema testing will ensure that the structure does not change, but in reality, the information will be hit as its distribution will tend to change over time. You can simulate this in DRIFT by varying the input parameters slightly and see if your model or its validation is able to detect and respond to that black.
The obvious Documenting your cleaning steps
In three months, you won't remember why you were limited request_day_within_3d to 10. Six months from now, your co-worker will break the pipe by removing your exhaust filter. In a year, the model will hit production, and no one will understand why it simply failed.
Texts are not optional. That's the difference between a respawning pipe and a voodoo script that runs until it's gone.
The NOUVROKER project documented all the changes in the code comments and the systematic literature sections with definitions and table of contents.
# Assignment
# Read and Explore All Datasets
# Data Engineering
Handling Pics Data
Number of Interactions Within 3 Days
Number of Interactions Within 7 Days
Merge Data
# Exploratory Data Analysis and Processing
# Feature Engineering
Remove Outliers
One-Hot Encoding
MinMaxScaler
Classical Machine Learning
Predicting Interactions Within 3 Days
Deep Learning
# Try to correct the first Json
# Try to replace corrupted values then convert to json
# Function to correct corrupted json and get count of photos
Control and news. Track the change in your cleaning logic. Keep the information central. Keep a Changelog of what you tried and what works.
The goal is not perfection. The goal is clarity. If you can't explain why you made a decision, you can't protect yourself when the model fails.
The obvious Final thoughts
Clean data is a myth. The best DATA scientists don't shy away from messy data; They know how to control themselves. They get lost values before training.
They are able to identify sellers before they influence forecasts. They check the Schemas before joining the tables. And they write everything down so the next person doesn't have to start from scratch.
There is no real impact from the complete data. It comes from the ability to deal with flawed data and still create something that works.
So when you have to deal with a dataset and you see null values, broken strings, and sellers, don't be afraid. What you see is not a problem but an opportunity to show your skills against real world data.
Nate receipt He is a data scientist and product strategist. He is also a self-proclaimed educationalist, and the founder of Stratascratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the job market, gives interactive advice, shares data science projects, and covers all things SQL.


