ANI

Data detox: Prepare for the messy, noisy, real world

Data data
Photo by the Author

The obvious Getting started

We've all spent hours tweaking the model, only to find out that it wasn't an algorithm but a false value that falsified your results on line 47,832. Kaggle's competitions give the impression that the data is generated as clean, labeled CSVS with class constraints, but in reality, this is not the case.

In this article, we'll use a real data project to explore four practical steps to prepare for dealing with messy, real datasets.

The obvious The NOVROKER Data Project: Real-World Chaos Hands-on Testing

NOBROKER is an Indian Property Technology (Prop-Tech) company that connects property owners and tenants directly in the broker-free marketplace.

Data dataData data

This data project is used during the recruitment process for data science positions at NOBROKER.

For this data project, Nondroker wants you to build a predictive model that shows how many interactions a property will receive within a given time period. We won't finish every project here, but we will help you find ways to train on dirty country data.

It has three datasets:

  • property_data_set.csv
    • It contains property information such as type, location, amenities, size, rent, and other housing characteristics.
  • property_photos.tsv
    • Contains pictures of buildings.
  • property_interactions.csv
    • It contains the communication time of the communication architecture.

The obvious Compare pure communication data versus real production data: A practical test

Interview essays are soft, balanced, and boring. Actual production details? It's a dumpster fire with missing values, double lines, inconsistent formats, and silence errors waiting until Friday at 5 PM to break your pipe.

Take the NOBROker property paStaset, a real world mess with 28,888 properties across three tables. At first glance, you look good. But dig deeper, and you'll find 11,022 missing Uniform Image (URLs), corrupted Json strings with rogue backslashes, and more.

This is the line between clean and messy. Clean data trains you to build models, but production data trains you to live in combat.

We will examine four training habits.

Data dataData data

The obvious Practice #1: Handling lost data

Lost data isn't just annoying; Decision point. Delete a line? Filled with purpose? Stuck as an unknown? The answer depends on why the information was lost and how much you can pay.

The NOUVROKER pataset had three types of missing data. This page photo_urls The column was missing 11,022 values ​​in 28,888 rows – that's 38% of the data. Here is the code.

Here is the output.

Data dataData data

Deleting these rows would delete important property records. Instead, the solution was to treat the missing images as if they were zero and move on.

def correction(x):
    if x is np.nan or x == 'NaN':
        return 0  # Missing photos = 0 photos
    else:
        return len(json.loads(x.replace('\', '').replace('{title','{"title')))
pics['photo_count'] = pics['photo_urls'].apply(correction)

Number columns are similar total_floor (23 missing) and column columns like building_type (38 lost), this strategy became a symbol. Fill in the value blanks with blanks, and the category blanks with mode.

for col in x_remain_withNull.columns:
    x_remain[col] = x_remain_withNull[col].fillna(x_remain_withNull[col].mean())
for col in x_cat_withNull.columns:
    x_cat[col] = x_cat_withNull[col].fillna(x_cat_withNull[col].mode()[0])

The first decision: Do not remove the mind from the questioning!

Understand the pattern. The missing image URLs were not random.

The obvious Practice #2: Find vendors

The frame is not always wrong, but it is always suspicious.

Can you imagine a property with 21 bathrooms, 800 years old, or 40,000 square feet of space? You found your dream property or someone made a data entry error.

NOUVROKER's database was full of these red flags. Box Plots revealed large values ​​across multiple columns: Over 100 Property Years, sizes exceeding 10,000 square feet Some were legal luxury properties. Most were data entry errors.

df_num.plot(kind='box', subplots=True, figsize=(22,10))
plt.show()

Here is the output.

Data dataData data

The solution to the problem of communication with people

To manage this, first we write a function that removes those sellers.

def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3 - q1
    fence_low = q1 - 2 * iqr
    fence_high = q3 + 2 * iqr
    df_out = df_in.loc[(df_in[col_name] <= fence_high) & (df_in[col_name] >= fence_low)]
    return df_out  # Note: Multiplier changed from 1.5 to 2 to match implementation.

We also use this code for numeric columns.

df = dataset.copy()
for col in df_num.columns:
    if col in ['gym', 'lift', 'swimming_pool', 'request_day_within_3d', 'request_day_within_7d']:
        continue  # Skip binary and target columns
    df = remove_outlier(df, col)
print(f"Before: {dataset.shape[0]} rows")
print(f"After: {df.shape[0]} rows")
print(f"Removed: {dataset.shape[0] - df.shape[0]} rows ({((dataset.shape[0] - df.shape[0]) / dataset.shape[0] * 100):.1f}% reduction)")

Here is the output.

Data dataData data

After removing vendors, Dataset ShraSk from 17,386 rows to 15,170, losing 12.7% of data while keeping the model to shame. The trade was worth it.

For target diversity request_day_within_3dsaturation was used instead of displacement. Values ​​above 10 are stuck at 10 to prevent excessive sellers from trend forecasting. In the following code, we also compare the results before and after.

def capping_for_3days(x):
    num = 10
    return num if x > num else x
df['request_day_within_3d_capping'] = df['request_day_within_3d'].apply(capping_for_3days)
before_count = (df['request_day_within_3d'] > 10).sum()
after_count = (df['request_day_within_3d_capping'] > 10).sum()
total_rows = len(df)
change_count = before_count - after_count
percent_change = (change_count / total_rows) * 100
print(f"Before capping (>10): {before_count}")
print(f"After capping (>10): {after_count}")
print(f"Reduced by: {change_count} ({percent_change:.2f}% of total rows affected)")

The result?

Data dataData data

Cleaner distribution, better model performance, and less debugging time.

The obvious Practice #3: Dealing with duplication and inconsistency

Repetitive is easy. Incompatibility is difficult. A double row is fine df.drop_duplicates(). An incompatible format, such as a JSON string composed of three different systems, requires the operation of a parser.

NOUVROKER's NOUVROKER data had one of the worst json inconsistencies. This page photo_urls The column was supposed to contain valid JOSS, but instead, it was filled with incorrect strings, missing quotes, back-to-back escapes, and random trailing characters.

text_before = pics['photo_urls'][0]
print('Before Correction: nn', text_before)

Here is the fix before the fix.

Data dataData data

The fix required changing multiple strings to correct the formatting before commenting. Here is the code.

text_after = text_before.replace('\', '').replace('{title', '{"title').replace(']"', ']').replace('],"', ']","')
parsed_json = json.loads(text_after)

Here is the output.

Data dataData data

The JSON was actually working and visible after editing. It's not a clean way to do this kind of trick, but it works.

You see inconsistent formats everywhere: Dates are stored as strings, typos in hash values, and ID values ​​are stored as floats.

The solution is static, as we do with JSON formatting.

The obvious Practice # 4: Data type validation and schema checks

It all starts when you upload your data. Finding out later that dates are strings or that numbers are objects can be a waste of time.

In the NOBROKER project, types were validated during CSV read-in, as the project enforces the correct data types beforehand adulterous head parameters. Here is the code.

data = pd.read_csv('property_data_set.csv')
print(data['activation_date'].dtype)  
data = pd.read_csv('property_data_set.csv',
                   parse_dates=['activation_date'], 
                   infer_datetime_format=True, 
                   dayfirst=True)
print(data['activation_date'].dtype)

Here is the output.

Data dataData data

The same validation was applied to the contact data.

interaction = pd.read_csv('property_interactions.csv',
    parse_dates=['request_date'], 
    infer_datetime_format=True, 
    dayfirst=True)

Not only is this a good practice, but it was important to be there down the river. The project required the calculation of days and time and the time of the difference between the performance and the requested days.

So the following code produces an error if dates are strings.

num_req['request_day'] = (num_req['request_date'] - num_req['activation_date']) / np.timedelta64(1, 'D')

Schema testing will ensure that the structure does not change, but in reality, the information will be hit as its distribution will tend to change over time. You can simulate this in DRIFT by varying the input parameters slightly and see if your model or its validation is able to detect and respond to that black.

The obvious Documenting your cleaning steps

In three months, you won't remember why you were limited request_day_within_3d to 10. Six months from now, your co-worker will break the pipe by removing your exhaust filter. In a year, the model will hit production, and no one will understand why it simply failed.

Texts are not optional. That's the difference between a respawning pipe and a voodoo script that runs until it's gone.

The NOUVROKER project documented all the changes in the code comments and the systematic literature sections with definitions and table of contents.

# Assignment
# Read and Explore All Datasets
# Data Engineering
Handling Pics Data
Number of Interactions Within 3 Days
Number of Interactions Within 7 Days
Merge Data
# Exploratory Data Analysis and Processing
# Feature Engineering
Remove Outliers
One-Hot Encoding
MinMaxScaler
Classical Machine Learning
Predicting Interactions Within 3 Days
Deep Learning
# Try to correct the first Json
# Try to replace corrupted values then convert to json
# Function to correct corrupted json and get count of photos

Control and news. Track the change in your cleaning logic. Keep the information central. Keep a Changelog of what you tried and what works.

The goal is not perfection. The goal is clarity. If you can't explain why you made a decision, you can't protect yourself when the model fails.

The obvious Final thoughts

Clean data is a myth. The best DATA scientists don't shy away from messy data; They know how to control themselves. They get lost values ​​before training.

They are able to identify sellers before they influence forecasts. They check the Schemas before joining the tables. And they write everything down so the next person doesn't have to start from scratch.

There is no real impact from the complete data. It comes from the ability to deal with flawed data and still create something that works.

So when you have to deal with a dataset and you see null values, broken strings, and sellers, don't be afraid. What you see is not a problem but an opportunity to show your skills against real world data.

Nate receipt He is a data scientist and product strategist. He is also a self-proclaimed educationalist, and the founder of Stratascratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the job market, gives interactive advice, shares data science projects, and covers all things SQL.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button