Creating a Return model to predict the Delivery Delivery: Active Guide | By Jimin Kang | Dec, 2024

To prepare data and test analysis
Now that we have expressed our way, let us look at our details and what kind of features we use.
From the above, we see our data contains a delivery of ~ 197,000 delivery, with the negative numbers and numbers. No features have lost large percentages (lowest Nyull ~ 181,000), so may not be concerned about disposing of any factors completely.
Let's see if our data contains any double deliverance, and if something is noted we cannot readily be brought to delivery.
print(f"Number of duplicates: {df.duplicated().sum()} n")print(pd.DataFrame({'Missing Count': df[['created_at', 'actual_delivery_time']].isna().sum()}))
We see that all of the delivery is unique. However, there are 7 things that lose real_delivery_titi, meaning that we cannot apply for the delivery time. As it is played only in this, we will remove this seen from our data.
Now, let's build our target for predictions. We want to predict the length of the delivery period (in seconds), when the past is the customer placing order ('created for_at_at' Real_Delivery_ Time ').
# convert columns to datetime
df['created_at'] = pd.to_datetime(df['created_at'], utc=True)
df['actual_delivery_time'] = pd.to_datetime(df['actual_delivery_time'], utc=True)# create prediction target
df['seconds_to_delivery'] = (df['actual_delivery_time'] - df['created_at']).dt.total_seconds()
The last thing we will do before dividing our data on the train / test checkout lost prices. We have already looked at non-profit statistics of each of the above element, but let's look at the ratings to find a better picture.
We see that market features ('Onishifift_Dashers', 'Building_anders', 'ismay_sperers')) with the highest percentage of lost prices (~ 8% lost). A feature that has the second database of the lost data is 'Store_Primary_category' (~ 2%). All other features have <1% lost.
As no of them have higher risk factors, we will not remove any of them. Later, we will consider the distribution of the feature to help us cope if you have to deal with something lost in each of the factors.
But first, let's separate our data from the train / test. We will continue at 80/20 sprit, and we will write this test data to a separate file that can affect until we test our last model.
from sklearn.model_selection import train_test_split
import os# shuffle
df = df.sample(frac=1, random_state=42)
df = df.reset_index(drop=True)
# split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
# write test data to separate file
directory = 'datasets'
file_name = 'test_data.csv'
file_path = os.path.join(directory, file_name)
os.makedirs(directory, exist_ok=True)
test_df.to_csv(file_path, index=False)
Now, let's get in the clarification of our train information. We will establish our features of numbers and categories, to make it clear which columns are targeted at the testing stations.
categorical_feats = [
'market_id',
'store_id',
'store_primary_category',
'order_protocol'
]numeric_feats = [
'total_items',
'subtotal',
'num_distinct_items',
'min_item_price',
'max_item_price',
'total_onshift_dashers',
'total_busy_dashers',
'total_outstanding_orders',
'estimated_order_place_duration',
'estimated_store_to_consumer_driving_duration'
]
Let's go back aspects of separation and lost prices ('Market_id', 'store_primary_category', 'Order_RostoCol'). Since there was no smaller data missing among those features (<3%), we simply force those amounts lost in the "Unknown" section.
- In this way, we will not have to remove data from other features.
- Perhaps the absence of characterized prices host the power to predict delivery time which means that these features do not randomly.
- Additionally, we will add this step to our preceding pipe during modeling, so that we will never again work on the work in our assessment environment.
missing_cols_categorical = ['market_id', 'store_primary_category', 'order_protocol']train_df[missing_cols_categorical] = train_df[missing_cols_categorical].fillna("unknown")
Let us look at our features of paragraph.
pd.DataFrame({'Cardinality': train_df[categorical_feats].nunique()}).rename_axis('Feature')
Since the 'market_ID' & 'Order_protocol' have a low car icons, we can visualize their distribution easily. On the other hand, the 'Store_id' & 'Store_Primary_CateGory' is the highest features of deception. We will look forward to those later.
import seaborn as sns
import matplotlib.pyplot as pltcategorical_feats_subset = [
'market_id',
'order_protocol'
]
# Set up the grid
fig, axes = plt.subplots(1, len(categorical_feats_subset), figsize=(13, 5), sharey=True)
# Create barplots for each variable
for i, col in enumerate(categorical_feats_subset):
sns.countplot(x=col, data=train_df, ax=axes[i])
axes[i].set_title(f"Frequencies: {col}")
# Adjust layout
plt.tight_layout()
plt.show()
Some important things to comment:
- ~ 70% of orders set up 'market_id' of 1, 2, 4
- <1% of orders 'order_protocol' of 6 or 7
Unfortunately, we do not have additional information about the variety, such as prices' market-taxes are associated with cities / locations each representative. During this time, requesting additional information about this information can be a good idea, as it can help to safety styles in the delivery lesson in the regional regional regional district / phase.
Let's look at our highest features of Carnidity. Maybe 'Store_Primary_crimary_crimary_corder_the Gradede We Have' Store_id Store '? If so, we may not need 'last_ID', 'Store_primary_castegory' would specify multiple details regarding the lated shop.
store_info = train_df[['store_id', 'store_primary_category']]store_info.groupby('store_primary_category')['store_id'].agg(['min', 'max'])
Obviously not: We see that 'Store_ID' store crosses all levels 'store_ -Rimary_category'.
Quick looks of different prices and frequencies associated with 'Store_id' & 'Store_primary_Category' shows that these features have top and distributed. Usually, the top features of the category of Carmenity may be a problem for postponing activities, especially for organized algoriths that only require numerical data. When these high carnodies features are included, they can expand the feature space, make data available and reduce the power of a standard speculation model in that feature. For a better and best explanation of phenomena, you can learn a lot about it here.
Let us find a feeling of how much aspects spread.
store_id_values = train_df['store_id'].value_counts()# Plot the histogram
plt.figure(figsize=(8, 5))
plt.bar(store_id_values.index, store_id_values.values, color='skyblue')
# Add titles and labels
plt.title('Value Counts: store_id', fontsize=14)
plt.xlabel('store_id', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()
We see that there are a few orders with orders, but most of them have more than 100.
Managing the top cartonality of 'Store_id', we will build another factor, 'store_oid_freq', groups 'store a_id' store usually.
- We will include prices 'Store_id' into five different bins shown below.
- 'Store
- Excessive inspiration behind this feature, check this cord.
def encode_frequency(freq, percentiles) -> str:
if freq < percentiles[0]:
return '[0-50)'
elif freq < percentiles[1]:
return '[50-75)'
elif freq < percentiles[2]:
return '[75-90)'
elif freq < percentiles[3]:
return '[90-99)'
else:
return '99+'value_counts = train_df['store_id'].value_counts()
percentiles = np.percentile(value_counts, [50, 75, 90, 99])
# apply encode_frequency to each store_id based on their number of orders
train_df['store_id_freq'] = train_df['store_id'].apply(lambda x: encode_frequency(value_counts[x], percentiles))
pd.DataFrame({'Count':train_df['store_id_freq'].value_counts()}).rename_axis('Frequency Bin')
The installation of our code shows that the delivery was listed from 90-999th in terms of popularity, and 12,000 Delivery was measured in stores at 0-50th shops.
Now that we have helped (we tried) to hold the relevant 'Store_id' details on the lower scale, let's try to do the same thing as 'Store_Pleterary_Category'.
Let's look at the most popular levels of 'Store_primary_CateGory'.
The quick appearance shows that most of these “Store_primary_Category 'is not specially saved from each other (E:' American '&' Burger '). The continuation of the investigation shows some of the many examples of this kind of pass.
So, let's try to cut these classes different from a few basic groups, including all.
store_category_map = {
'american': ['american', 'burger', 'sandwich', 'barbeque'],
'asian': ['asian', 'chinese', 'japanese', 'indian', 'thai', 'vietnamese', 'dim-sum', 'korean',
'sushi', 'bubble-tea', 'malaysian', 'singaporean', 'indonesian', 'russian'],
'mexican': ['mexican'],
'italian': ['italian', 'pizza'],
}def map_to_category_type(category: str) -> str:
for category_type, categories in store_category_map.items():
if category in categories:
return category_type
return "other"
train_df['store_category_type'] = train_df['store_primary_category'].apply(lambda x: map_to_category_type(x))
value_counts = train_df['store_category_type'].value_counts()
# Plot pie chart
plt.figure(figsize=(6, 6))
value_counts.plot.pie(autopct='%1.1f%%', startangle=90, cmap='viridis', labels=value_counts.index)
plt.title('Category Distribution')
plt.ylabel('') # Hide y-axis label for aesthetics
plt.show()
The group is probably simple, and there may be a better way to combine these phases. Let's get on with it now for the simplicity.
We have made a good investigation into our phase features. Let's look at the distribution of our numbers.
# Create grid for boxplots
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 15)) # Adjust figure size
axes = axes.flatten() # Flatten the 5x2 axes into a 1D array for easier iteration# Generate boxplots for each numeric feature
for i, column in enumerate(numeric_feats):
sns.boxplot(y=train_df[column], ax=axes[i])
axes[i].set_title(f"Boxplot for {column}")
axes[i].set_ylabel(column)
# Remove any unused subplots (if any)
for i in range(len(numeric_feats), len(axes)):
fig.delaxes(axes[i])
# Adjust layout for better spacing
plt.tight_layout()
plt.show()
Much distribution appears to the right hand then caused by merchants.
In particular, there seems to be a command to 400+. This seems unusual as the next advanced order is less than 100 objects.
Let's see a lot in what 400+ order item.
train_df[train_df['total_items']==train_df['total_items'].max()]