Pandas: Advanced Groupby Techniques for Aggregations


Photo by the Author
The obvious Getting started
There groupby().sum() and groupby().mean() Great for quick checks, production-level menus require robust solutions. Real-world tables often involve multiple keys, Time-Series data, instruments, and various criteria such as increments, returns, or vendors.
This means that you often need to collect values and values, standardize items between each component, roll the data into calendar buckets, and combine the group statistics back into the original lines of the model. This article will guide you through advanced programming techniques using the Pandas library to deal with these complex situations.
The obvious Choosing the right mode
// AGG is used to reduce groups to one row
Work agg There you want one record per group, such as indicators, methods, Media, min / max values, and customized reductions.
out = (
df.groupby(['store', 'cat'], as_index=False, sort=False)
.agg(sales=('rev', 'sum'),
orders=('order_id', 'nunique'),
avg_price=('price', 'mean'))
)
This is great for key performance (KPI) tables, weekly rollups, and multi metric summaries.
// It is used to convert to broadcast statistics back to the rows
This page transform The method returns the result in the same form as the input. It's great for creating the features you need for each row, such as Z-Scores, group shares, or group completions.
g = df.groupby('store')['rev']
df['rev_z'] = (df['rev'] - g.transform('mean')) / g.transform('std')
df['rev_share'] = df['rev'] / g.transform('sum')
This is good for modeling features, quality assurance measurement, and prevention.
// It is used to apply customization for each group
Work apply only then the necessary logic can be expressed by built-in functions. It's slow and hard to get right, so you have to try agg or transform First of all.
def capped_mean(s):
q1, q3 = s.quantile([.25, .75])
return s.clip(q1, q3).mean()
df.groupby('store')['rev'].apply(capped_mean)
This is great for bespoke orders and small groups.
// A filter is used to keep or drop all groups
This page filter The method allows all parties to pass or fail the situation. This is useful for data quality rules and construction.
big = df.groupby('store').filter(lambda g: g['order_id'].nunique() >= 100)
This is good for small size cohorts and for removing sparse sections before pooling.
The obvious There are various prominent groups and well-known organizations
// Collecting with many buttons
You can control the output structure and order so that the results can be sent directly to the business intelligence tool.
g = df.groupby(['store', 'cat'], as_index=False, sort=False, observed=True)
as_index=FalseIt returns a fat dataframe, which is easy to join and exportsort=Falseto avoid reordering groups, they keep the job when the order is not workingobserved=True(in columns of categories) drops in unused categories
// Named aggregations are used
Named symbols produce readable column names, similar to SQL.
out = (
df.groupby(['store', 'cat'])
.agg(sales=('rev', 'sum'),
orders=('order_id', 'nunique'), # use your id column here
avg_price=('price', 'mean'))
)
// Columns with styles
If you put in more logs, you will get a MultiIndex. Subdue it at once and stop the command of the column.
out = out.reset_index()
out.columns = [
'_'.join(c) if isinstance(c, tuple) else c
for c in out.columns
]
# optional: ensure business-friendly column order
cols = ['store', 'cat', 'orders', 'sales', 'avg_price']
out = out[cols]
The obvious Conditional methods without applying
// You use boolean-maskambo calculations inside agg
When the mask depends on other columns, sort the data by their index.
# promo sales and promo rate by (store, cat)
cond = df['is_promo']
out = df.groupby(['store', 'cat']).agg(
promo_sales=('rev', lambda s: s[cond.loc[s.index]].sum()),
promo_rate=('is_promo', 'mean') # proportion of promo rows
)
// Calculating prices and estimates
The rating is just that sum(mask) / sizewhich is equivalent to a Boolean statement of faith.
df['is_return'] = df['status'].eq('returned')
rates = df.groupby('store').agg(return_rate=('is_return', 'mean'))
// Creating cohort-style windows
First, a puzzle mask with date parameters, and then combine the details.
# example: repeat purchase within 30 days of first purchase per customer cohort
first_ts = df.groupby('customer_id')['ts'].transform('min')
within_30 = (df['ts'] <= first_ts + pd.Timedelta('30D')) & (df['ts'] > first_ts)
# customer cohort = month of first purchase
df['cohort'] = first_ts.dt.to_period('M').astype(str)
repeat_30_rate = (
df.groupby('cohort')
.agg(repeat_30_rate=('within_30', 'mean'))
.rename_axis(None)
)
The obvious Weighted metrics for each group
// Using a weighted average pattern
Look at the statistics and guard against the high odds.
import numpy as np
tmp = df.assign(wx=df['price'] * df['qty'])
agg = tmp.groupby(['store', 'cat']).agg(wx=('wx', 'sum'), w=('qty', 'sum'))
# weighted average price per (store, cat)
agg['wavg_price'] = np.where(agg['w'] > 0, agg['wx'] / agg['w'], np.nan)
// Managing nan values safely
Decide whether to return empty groups or all-NaN values. Two common options are:
# 1) Return NaN (transparent, safest for downstream stats)
agg['wavg_price'] = np.where(agg['w'] > 0, agg['wx'] / agg['w'], np.nan)
# 2) Fallback to unweighted mean if all weights are zero (explicit policy)
mean_price = df.groupby(['store', 'cat'])['price'].mean()
agg['wavg_price_safe'] = np.where(
agg['w'] > 0, agg['wx'] / agg['w'], mean_price.reindex(agg.index).to_numpy()
)
The obvious Accumulating time
// You use pd.grouper on frequency
Remove the parameters of the KPIS Calendar by plotting the Time-Series data over time.
weekly = df.groupby(['store', pd.Grouper(key='ts', freq='W')], observed=True).agg(
sales=('rev', 'sum'), orders=('order_id', 'nunique')
)
// It works to mount / expand windows per group
Always sort your data first and align to the timestamp column.
df = df.sort_values(['customer_id', 'ts'])
df['rev_30d_mean'] = (
df.groupby('customer_id')
.rolling('30D', on='ts')['rev'].mean()
.reset_index(level=0, drop=True)
)
// To avoid data leakage
Maintain chronological order and ensure that Windows only “see” past data. Do not interpolate Time-Series data, and do not group statistics on the total data before segmentation for training and testing.
The obvious Status and top-n within groups
// Finding the top-k rows in each group
Here are two practical options for choosing the top N rows from each group.
# Sort + head
top3 = (df.sort_values(['cat', 'rev'], ascending=[True, False])
.groupby('cat')
.head(3))
# Per-group nlargest on one metric
top3_alt = (df.groupby('cat', group_keys=False)
.apply(lambda g: g.nlargest(3, 'rev')))
// Uses Help functions
Pandas provides several Helper functions for level and selection.
position – Controls how obligations are managed (eg. method='dense' or 'first') and can calculate percentile levels with pct=True.
df['rev_rank_in_cat'] = df.groupby('cat')['rev'].rank(method='dense', ascending=False)
stealing – Gives the 0-based position of each line within its group.
df['pos_in_store'] = df.groupby('store').cumcount()
etc – Selects the K-th row by group without filtering the rest of the data.
second_row = df.groupby('store').nth(1) # the second row present per store
The obvious Features of streaming by converting
// To perform a general tomopwing
Make a metric on the slope of each group so that the lines can be compared to different groups.
g = df.groupby('store')['rev']
df['rev_z'] = (df['rev'] - g.transform('mean')) / g.transform('std')
// Entering missing values
Fill in the missing values with the group equation. This tends to keep the distribution closer to the original rather than using a global fill value.
df['price'] = df['price'].fillna(df.groupby('cat')['price'].transform('median'))
// Creating group sharing features
Convert raw numbers to intra-group averages for pure comparisons.
df['rev_share_in_store'] = df['rev'] / df.groupby('store')['rev'].transform('sum')
The obvious Handling categories, empty groups, and missing data
// Improving speed with types of sections
If your keys are from an organized set (eg This does GroupBy Faster performance and more memory.
from pandas.api.types import CategoricalDtype
store_type = CategoricalDtype(categories=sorted(df['store'].dropna().unique()), ordered=False)
df['store'] = df['store'].astype(store_type)
cat_type = CategoricalDtype(categories=['Grocery', 'Electronics', 'Home', 'Clothing', 'Sports'])
df['cat'] = df['cat'].astype(cat_type)
// Discard unused compound
When I edit the columns of the sections, the setting observed=True It excludes Sterry Mailys that actually do not occur in the data, resulting in a cleaner output with less noise.
out = df.groupby(['store', 'cat'], observed=True).size().reset_index(name="n")
// Meeting with Zan Buttons
State how you handle lost keys. By default, the pandas throw up NaN parties; Keep them only if they help with your quality assurance process.
# Default: NaN keys are dropped
by_default = df.groupby('region').size()
# Keep NaN as its own group when you need to audit missing keys
kept = df.groupby('region', dropna=False).size()
The obvious Quick cheat sheet
// Calculating the conditional value for each group
# mean of a boolean is a rate
df.groupby(keys).agg(rate=('flag', 'mean'))
# or explicitly: sum(mask)/size
df.groupby(keys).agg(rate=('flag', lambda s: s.sum() / s.size))
// What does counting mean
df.assign(wx=df[x] * df[w])
.groupby(keys)
.apply(lambda g: g['wx'].sum() / g[w].sum() if g[w].sum() else np.nan)
.rename('wavg')
// Finding TOP-K for each group
(df.sort_values([key, metric], ascending=[True, False])
.groupby(key)
.head(k))
# or
df.groupby(key, group_keys=False).apply(lambda g: g.nlargest(k, metric))
// Calculating weekly metrics
df.groupby([key, pd.Grouper(key='ts', freq='W')], observed=True).agg(...)
// To do group filling
df[col] = df[col].fillna(df.groupby(keys)[col].transform('median'))
// Calculating the share within the group
df['share'] = df[val] / df.groupby(keys)[val].transform('sum')
The obvious Wrapping up
First, choose the appropriate mode for your work: Use agg reduce, transform broadcasting, and withholding apply Because there vet submission is not an option. lean on pd.Grouper With time-based buckets and top-n ranking helpers. By liking clear, reproducible patterns, you can keep your results flat, named, and easy to test, to ensure your metrics stay good and your notebooks run fast and your notebooks run quickly and your notebooks run quickly and your notebooks run quickly and your notebooks run quickly and your notebooks run quickly and your notebooks run quickly.
Josep Ferrer is an analytics engineer from Barcelona. He graduated in engineering physics and currently works in a data science company used for human mobility. He is a part-time content creator focused on science and technology. Josep writes on all things AI, covering the use of continuous explosions on the field.



