I built a data cleaning method that uses one dirty doordash data


Image editor
The obvious Getting started
In accordance with Crowd SurveyData scientists spend 60% of their time organizing and cleaning data.
In this article, we will walk through building a data cleaning pipeline using a real life dataset from doordash. It contains approximately 200,000 food delivery records, each including a number of characteristics such as delivery time, total items, and store category (eg, American cuisine).
The obvious Predicting food delivery times with doordash data


Doordash aims to accurately measure the time it takes to deliver food, from the time a customer places an order to the time it arrives at their door. between This data projectwe were tasked with developing a model that predicts total delivery time based on historical delivery data.
However, we won't do the whole project—ie, we won't build a predictive model. Instead, we will use the dataset provided in the project and create a data cleaning pipeline.
Our workflow consists of two major steps.

The obvious Data analysis

Let's start by loading and viewing the first few rows of data.
// Upload and preview the dataset
import pandas as pd
df = pd.read_csv("historical_data.csv")
df.head()
Here is the output.

This data includes a rettime column that captures the order's creation time and actual delivery time, which can be used to calculate delivery time. It also contains other features such as the store category, the total price of the item, the bottom, and the specific price of the item, making it easy for various types of data analysis. We already see that there are some values of Nan, which we will explore more in the next step.
// Check the columns with info()
Let's check all column names with Details () way. We will use this method throughout the article to see the changes in the column value calculation; It is a good indicator of lost data and complete data health.
Here is the output.

As you can see, we have 15 columns, but the number of non-null values is different from them. This means that some columns contain missing values, which can affect our analysis if not done correctly. One last thing: The created_ and real_delivery_time Data types are objects; This should be datette time.
The obvious Creating a Data Cleansing Database
In this step, we create a clean data cleaning pipeline to prepare data for modeling. Each section discusses general issues such as time formats, missing values, and inappropriate features.

// Modifying column types for time and time
Before doing the data analysis, we need to prepare the time columns. Besides, the calculation we mentioned (Real_Delivery_TiTE – ADDEND_AT) will go wrong.
What we are ready for:
- created_: When placing an order
- real_delivery_time: When the food comes
These two numbers are stored as objects, so that you can do the calculations properly, we have to convert them to DateTime format. To do that, we can use DatTet functions on it adulterous head. Here is the code.
import pandas as pd
df = pd.read_csv("historical_data.csv")
# Convert timestamp strings to datetime objects
df["created_at"] = pd.to_datetime(df["created_at"], errors="coerce")
df["actual_delivery_time"] = pd.to_datetime(df["actual_delivery_time"], errors="coerce")
df.info()
Here is the output.

As you can see from the screenshot above, The created_ and real_delivery_time it's datetmet stuff now.

Among the main columns, Store_Primary_category It has the fewest non-null values (192,668), which means it has the most missing data. That's why we're going to focus on cleaning first.
// Data impretation with mode()
One of the best columns in the database, evident from its high number of missing values, Store_Primary_category. It tells us what type of food stores are available, such as Mexican, American, and Thai. However, many lines are missing this information, which is a problem. For example, it may limit how we can aggregate or analyze data. So how do we fix it?
We will fill these lines instead of discarding them. To do that, we will use smart authentication.
We write a dictionary that maps each one Store_id In this section it is very common, and use that medium to fill in the missing values. Let's look at the data before doing that.

Here is the code.
import numpy as np
# Global most-frequent category as a fallback
global_mode = df["store_primary_category"].mode().iloc[0]
# Build store-level mapping to the most frequent category (fast and robust)
store_mode = (
df.groupby("store_id")["store_primary_category"]
.agg(lambda s: s.mode().iloc[0] if not s.mode().empty else np.nan)
)
# Fill missing categories using the store-level mode, then fall back to global mode
df["store_primary_category"] = (
df["store_primary_category"]
.fillna(df["store_id"].map(store_mode))
.fillna(global_mode)
)
df.info()
Here is the output.

As you can see from the screenshot above, The Store_Primary_category Now the column has a non-null maximum Count. But let's double check with this code.
df["store_primary_category"].isna().sum()
Here is the output showing the number of nan values. It is zero; We removed them all.

And let's look at the dataset after rendering.


// Disposing of the remaining Nans
In the previous step, we were correcting Store_Primary_categorybut did you see something? The non-negative numbers across the columns still don't match!
This is a clear indication that we are still dealing with missing values in some of the data. Now, when it comes to data cleaning, we have two options:
- Fill in the missing values
- They threw them
Given that this data contains nearly 200,000 rows, we can afford to lose some. With smaller dasets, you will have to be more careful. In that case, it is advisable to analyze each column, to establish levels (determine how many values are missing – using the median, importance, or to fill them with background), and fill them.
To remove Nans, we will use the Dropna() method from the Pandas library. We put ourselves Inceple = true To apply changes directly to DataFAME without needing to assign it again. Let's look at the dataset this time.

Here is the code.
df.dropna(inplace=True)
df.info()
Here is the output.

As you can see from the screenshot above, each column now has the same number of non-null values.
Let's look at the data behind all the changes.

// What can you do next?
Now that we have a clean dataset, here are a few things you can do next:
- Perform EDA to understand delivery patterns.
- New Engineer features like delivery hours or dash dashers add more detail to your analysis.
- Analyze the relationship between variables to increase the performance of your model.
- Create different models for recycling and find the best model.
- Estimate delivery time with a better made model.
The obvious Final thoughts
In this article, we cleaned up a real-life dataset from doordash by looking at common quality issues, such as correcting incorrect data types and handling missing values. We built a simple data cleaning pipeline for this data project and explored the next steps.
Real-World dates can be messier than you think, but there are many ways and tricks to solve these problems. Thanks for reading!
Nate receipt He is a data scientist and product strategist. He is also a self-proclaimed educationalist, and the founder of Stratascratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the job market, gives interactive advice, shares data science projects, and covers all things SQL.



