Public EDA (Part 2): Product Deep Dive & Time-Series Analysis in Pandas

! Welcome back to the “EDA in Public” series! This is Part 2 of the series; if you haven't seen Part 1, read it here. Here is a summary of what we won.
In Part 1, we took a dirty, million-line file and made it into a format. Specifically, we:
- Sample data for quick processing.
- Fixed missing values and standard text entries (such as 'EIRE' to 'Ireland').
- All the noise is filtered out – returns, cancellations, and purchases at zero cost.
- Build our most important feature: i
Revenuecolumn. - We have gained our first understanding of business: i Top 10 money producing countries.
We now have a clean, revenue-friendly Pandas DataFrame ready to use for insights! By the end of this thread, I want to be a master of test data analysis using Pandas. Feel free to follow if you are a data lover.
We are now focusing our attention on bringing truly impactful business intelligence to NovaShop. Our overall goal for Part 2 is to answer the key questions: What the best products, again when the best time to sell them?
This part is about analysis using powerful Pandas clustering techniques (.groupby()) and enter engineering using the time-of-day column (.dt accessory). We will divide our analysis into two main areas:
- Product Deep Dive: Identifying products that move more units versus those that bring in more revenue.
- Sales Period (Time Series): Revealing sales season (monthly peaks) and performance trends (busiest hour of the day).
Without further ado, let's get to it.
Loading clean data
To improve performance, let's export our cleaned dataset and import it again for analysis
df.to_csv('online_retail_clean.csv', index=False)
Now, I can create a new project and start over.
When I load CSVs that contain date columns, I try to use the parse_dates conflict between pd.read_csv(). This saves me the step of manually changing the column type later, making sure the date properties are correct from the start.
import pandas as pd
df = pd.read_csv('online_retail_clean.csv', parse_dates=['InvoiceDate'])
print(f"Successfully loaded clean data. Total rows: {len(df)}")
Output:
Successfully loaded clean data. Total rows: 52933
Product Deep Dive: Who Are the Real Money Makers?
With the clean data loaded, let's jump straight into product performance. Products can be considered “best-selling” in two different ways: by volume (units sold) or by value (revenue).
Insight 2: Top 10 Sellers (In Bulk)
The price list tells us which products are popular and in demand, even if their price is low. This is important for stock keeping and inventory management.
We will collect by product Descriptioninclude i Quantityand arrange it in descending order.
# Top 10 Bestsellers (by Quantity)
df.groupby(‘Description’)[‘Quantity’].sum().sort_values(ascending=False).head(10)
Output:
Description
MEDIUM CERAMIC TOP STORAGE JAR 74698
WORLD WAR 2 GLIDERS ASSTD DESIGNS 9419
GIN + TONIC DIET METAL SIGN 5016
ASSORTED COLOURS SILK FAN 4684
JUMBO BAG RED RETROSPOT 4671
PACK OF 72 RETROSPOT CAKE CASES 3561
MINI PAINT SET VINTAGE 3285
ASSORTED COLOUR BIRD ORNAMENT 2783
BROCADE RING PURSE 2525
POPCORN HOLDER 2508
Name: Quantity, dtype: int64
Insight 3: Top 10 Income Drivers (Real Money Makers)
This is what NovaShop really wants to know. These products drive profit. They may not sell as many units as the items on the previous list, but their higher prices or bulk orders make them more valuable.
We follow the same process, but this time we include in Revenue column we modified in Part 1.
# Top 10 Revenue Drivers (The True Money-Makers)
df.groupby(‘Description’)[‘Revenue’].sum().sort_values(ascending=False).head(10)
Output:
Description
MEDIUM CERAMIC TOP STORAGE JAR 77785.74
DOTCOM POSTAGE 18456.13
REGENCY CAKESTAND 3 TIER 15093.38
AMAZON FEE 13541.33
PARTY BUNTING 9491.60
GIN + TONIC DIET METAL SIGN 9480.51
JUMBO BAG RED RETROSPOT 8996.65
Manual 8966.96
CHILLI LIGHTS 8395.20
POSTAGE 7194.51
Name: Revenue, dtype: float64
Discussion: What These Lists Tell You NovaShop
This is where the power of EDA really shines, because these two lists tell a very different story:
- All Star Product: I “TOP CERAMIC STORAGE JAR” is clearly the winner, dominating both lists by a large margin. High volume and high value. This item should be NovaShop's main focus in inventory, marketing, and expansion.
- Surprising source of income: While examining the data, I noticed things like this “DOTCOM POSTAGE,” “AMAZON FEE,” “Manual,” again “SENDING.” These are not actual products — they are service fees and manual repairs. Yet together, they bring almost £48,000 in income. NovaShop should double check that these fees are included in their net profit. It's easy to miss, but it can be a quiet and meaningful source of income hiding in plain sight.
- High Value Tangible Assets: Things like “REGENCY CAKESTAND 3 TIER” again “CHILLI LIGHTS” list income but not list value. This tells us that it's the high-value items that contribute the most to total revenue, and are worth targeted marketing efforts.
Now we have a complete picture of what NovaShop sells best. Time to find out when!
Preparing for Time Series Analysis: Feature Engineering
We know what has sold well. Now, let's find out when sales peak. Our goal is to break the InvoiceDate column in sections that allow us to collect and analyze sales by year, month, day, and even hour.
This is a classic feature engineering move. As we have already made sure InvoiceDate proper pandas datetime thing (thanks parse_dates in our loading step!), we can use the most useful .dt accessory.
Components of Release Time
To understand the time of year (monthly) and efficiency (hourly), we will create four new columns.
- Year – For long-term comparisons
- Month – Identifying seasonal peaks (eg, Q4 rush).
- DayName – Finding the busiest day of the week.
- Hour – To determine the maximum working hours.
Let's generate these columns:
print(“n — — Time Series Feature Extraction — -”)
# Extract Year, Month, DayName, and Hour from the InvoiceDate column
df[‘Year’] = df[‘InvoiceDate’].dt.year
df[‘Month’] = df[‘InvoiceDate’].dt.month
df[‘DayName’] = df[‘InvoiceDate’].dt.day_name()
df[‘Hour’] = df[‘InvoiceDate’].dt.hour
print(“New time features added: Year, Month, DayName, Hour.”)
print(df[[‘InvoiceDate’, ‘Year’, ‘Month’, ‘DayName’, ‘Hour’]].head())
Output:
--- Time Series Feature Extraction ---
New time features added: Year, Month, DayName, Hour.
InvoiceDate Year Month DayName Hour
0 2011-06-01 12:05:00 2011 6 Wednesday 12
1 2011-05-27 17:14:00 2011 5 Friday 17
2 2011-04-21 17:05:00 2011 4 Thursday 17
3 2011-11-16 10:39:00 2011 11 Wednesday 10
4 2011-03-10 08:40:00 2011 3 Thursday 8
Perfect! We can get right into mapping sales patterns in the next section.
Mapping Sales Patterns (Time Series Data)
With our time elements ready, we can now ask and answer our questions about when sales occur. We'll start big (monthly trends) and work our way down (hourly trends).
Insight 4: Monthly Sales Trends (Seasons)
Monthly consolidation is important for visualization the season– a predictable sales pattern that repeats every year. This data is used for financial forecasting and planning for peak season inventory levels.
We will make a group by Month column and calculate the total income for each.
print(“n — — Insight 4: Sales Trends by Month (Seasonality) — -”)
# Group by month and sum the revenue
monthly_revenue = df.groupby(‘Month’)[‘Revenue’].sum().sort_values(ascending=False)
# Print the results, which should show the seasonal peaks
print(monthly_revenue)
Output:
--- Insight 4: Sales Trends by Month (Seasonality) ---
Month
11 143576.360
1 142013.420
12 133178.980
10 119533.540
9 103929.420
3 72968.270
8 71302.890
5 70424.510
6 68722.520
7 68713.831
4 51882.010
2 50178.730
Name: Revenue, dtype: float64
Translation: Massive Q4 and January Surprise
The data clearly shows that NovaShop's sales are heavily skewed toward the end of the year, confirming normal sales. Q4 (October, November, December) fast .
- High Season: The month of November is the highest month by a narrow margin, followed closely by December. This is the last time in stock Medium Tall Ceramic Storage Jars and run targeted ads.
- January Surge: The second highest month is January! This is the best understanding of NovaShop. It suggests that customers may be spending money on Christmas gifts or taking advantage of post-holiday sales. This period should not be considered as a slump but as a second chance to sell in high volume.
- Low season:February and April are the lowest performers, which helps NovaShop plan for lower staffing and marketing budgets during those times.
Insight 5: Sales Trends by Day of the Week and Hour (Operation Planning)
While seasonality is about finances and inventory, daily and hourly trends are about activities. NovaShop can use this to schedule warehouse staff, optimize advertising usage, and time email campaigns.
We will use two different aggregations here: one for the day of the week and one for the hour of the day.
Income by day of the week
# 1. Sales by Day of Week
daily_revenue = df.groupby(‘DayName’)[‘Revenue’].sum()
print(“n — — Revenue by Day of Week — -”)
print(daily_revenue)
Output:
--- Revenue by Day of Week ---
DayName
Friday 163861.320
Monday 171026.230
Sunday 83125.890
Thursday 219342.980
Tuesday 282796.741
Wednesday 176271.320
Name: Revenue, dtype: float64
Translation: Tuesday is the day of power
The data reveals a clear and predictable pattern of the work week:
- Higher Sales : Tuesday completely dominates the week, earning nearly 30% more than the next busiest day, Thursday. This is an important day to launch new products or run flash sales that have a big impact.
- Slump Days: Unsurprisingly, sales plummeted On Sunday. NovaShop may consider this a less important day for email marketing or customer service staff.
- Day off: The data set has no transactions of On Saturday in our sample (or perhaps the full set, common to B2B datasets).
Income per hour of the day
# 2. Sales by Hour of Day
hourly_revenue = df.groupby(‘Hour’)[‘Revenue’].sum().sort_index()
print(“n — — Revenue by Hour of Day — -”)
print(hourly_revenue)
Output:
--- Revenue by Hour of Day ---
Hour
7 2830.910
8 26597.800
9 76768.750
10 209809.971
11 115233.600
12 142474.460
13 130348.290
14 119241.520
15 145178.980
16 70620.550
17 36148.030
18 14022.790
19 4397.130
20 2751.700
Translation: Late Morning Boom
Hourly data points directly to operational requirements:
- Massive Peak: Sales explode at 10 AM (10 o'clock), generating almost double the revenue for the next few hours. This is the perfect time for NovaShop to launch email marketing and social media campaigns.
- Work in Progress: Sales are highest from 11 AM to 3 PM, suggesting a strong buying window during the day and after lunch.
- End of day:Activity drops sharply after 4 PM, with very little income earned in the evening (7 PM onwards). This shows that NovaShop's customer base is primarily desk-based, shopping during traditional business hours.
Conclusion: Setting the Stage for Customer Segmentation
We have officially completed the hard EDA! We know what sells the most (Ceramic bowl and payments), when sales are strongest (UK), and when sales peaks occur (Q4, Tuesdays, and 10 AM).
However, we are still missing the most important part of the puzzle: WHO. The best companies don't just know what sells; they know who their most valuable customers are and how to keep them.
In Part 3, we'll dive into the most advanced analytical method in this series: RFM (Recency, Frequency, Currency) Customer Segmentation. This will allow us to segment customers into categories such as “Champions,” “Loyal Customers,” and “Risk,” giving NovaShop a truly personalized marketing roadmap.
I hope you found this article helpful. Feel free to say hi to me on any of these forums. I really appreciate your reply.
In the middle
YouTube



