ANI

Pandas: Advanced Groupby Techniques for Aggregations

Pandas: Advanced Groupby Techniques for Aggregations
Photo by the Author

The obvious Getting started

There groupby().sum() and groupby().mean() Great for quick checks, production-level menus require robust solutions. Real-world tables often involve multiple keys, Time-Series data, instruments, and various criteria such as increments, returns, or vendors.

This means that you often need to collect values ​​and values, standardize items between each component, roll the data into calendar buckets, and combine the group statistics back into the original lines of the model. This article will guide you through advanced programming techniques using the Pandas library to deal with these complex situations.

The obvious Choosing the right mode

// AGG is used to reduce groups to one row

Work agg There you want one record per group, such as indicators, methods, Media, min / max values, and customized reductions.

out = (
    df.groupby(['store', 'cat'], as_index=False, sort=False)
      .agg(sales=('rev', 'sum'),
           orders=('order_id', 'nunique'),
           avg_price=('price', 'mean'))
)

This is great for key performance (KPI) tables, weekly rollups, and multi metric summaries.

// It is used to convert to broadcast statistics back to the rows

This page transform The method returns the result in the same form as the input. It's great for creating the features you need for each row, such as Z-Scores, group shares, or group completions.

g = df.groupby('store')['rev']
df['rev_z'] = (df['rev'] - g.transform('mean')) / g.transform('std')
df['rev_share'] = df['rev'] / g.transform('sum')

This is good for modeling features, quality assurance measurement, and prevention.

// It is used to apply customization for each group

Work apply only then the necessary logic can be expressed by built-in functions. It's slow and hard to get right, so you have to try agg or transform First of all.

def capped_mean(s):
    q1, q3 = s.quantile([.25, .75])
    return s.clip(q1, q3).mean()

df.groupby('store')['rev'].apply(capped_mean)

This is great for bespoke orders and small groups.

// A filter is used to keep or drop all groups

This page filter The method allows all parties to pass or fail the situation. This is useful for data quality rules and construction.

big = df.groupby('store').filter(lambda g: g['order_id'].nunique() >= 100)

This is good for small size cohorts and for removing sparse sections before pooling.

The obvious There are various prominent groups and well-known organizations

// Collecting with many buttons

You can control the output structure and order so that the results can be sent directly to the business intelligence tool.

g = df.groupby(['store', 'cat'], as_index=False, sort=False, observed=True)
  • as_index=False It returns a fat dataframe, which is easy to join and export
  • sort=False to avoid reordering groups, they keep the job when the order is not working
  • observed=True (in columns of categories) drops in unused categories

// Named aggregations are used

Named symbols produce readable column names, similar to SQL.

out = (
    df.groupby(['store', 'cat'])
      .agg(sales=('rev', 'sum'),
           orders=('order_id', 'nunique'),    # use your id column here
           avg_price=('price', 'mean'))
)

// Columns with styles

If you put in more logs, you will get a MultiIndex. Subdue it at once and stop the command of the column.

out = out.reset_index()
out.columns = [
    '_'.join(c) if isinstance(c, tuple) else c
    for c in out.columns
]
# optional: ensure business-friendly column order
cols = ['store', 'cat', 'orders', 'sales', 'avg_price']
out = out[cols]

The obvious Conditional methods without applying

// You use boolean-maskambo calculations inside agg

When the mask depends on other columns, sort the data by their index.

# promo sales and promo rate by (store, cat)
cond = df['is_promo']
out = df.groupby(['store', 'cat']).agg(
    promo_sales=('rev', lambda s: s[cond.loc[s.index]].sum()),
    promo_rate=('is_promo', 'mean')  # proportion of promo rows
)

// Calculating prices and estimates

The rating is just that sum(mask) / sizewhich is equivalent to a Boolean statement of faith.

df['is_return'] = df['status'].eq('returned')
rates = df.groupby('store').agg(return_rate=('is_return', 'mean'))

// Creating cohort-style windows

First, a puzzle mask with date parameters, and then combine the details.

# example: repeat purchase within 30 days of first purchase per customer cohort
first_ts = df.groupby('customer_id')['ts'].transform('min')
within_30 = (df['ts'] <= first_ts + pd.Timedelta('30D')) & (df['ts'] > first_ts)

# customer cohort = month of first purchase
df['cohort'] = first_ts.dt.to_period('M').astype(str)

repeat_30_rate = (
    df.groupby('cohort')
      .agg(repeat_30_rate=('within_30', 'mean'))
      .rename_axis(None)
)

The obvious Weighted metrics for each group

// Using a weighted average pattern

Look at the statistics and guard against the high odds.

import numpy as np

tmp = df.assign(wx=df['price'] * df['qty'])
agg = tmp.groupby(['store', 'cat']).agg(wx=('wx', 'sum'), w=('qty', 'sum'))

# weighted average price per (store, cat)
agg['wavg_price'] = np.where(agg['w'] > 0, agg['wx'] / agg['w'], np.nan)

// Managing nan values ​​safely

Decide whether to return empty groups or all-NaN values. Two common options are:

# 1) Return NaN (transparent, safest for downstream stats)
agg['wavg_price'] = np.where(agg['w'] > 0, agg['wx'] / agg['w'], np.nan)

# 2) Fallback to unweighted mean if all weights are zero (explicit policy)
mean_price = df.groupby(['store', 'cat'])['price'].mean()
agg['wavg_price_safe'] = np.where(
    agg['w'] > 0, agg['wx'] / agg['w'], mean_price.reindex(agg.index).to_numpy()
)

The obvious Accumulating time

// You use pd.grouper on frequency

Remove the parameters of the KPIS Calendar by plotting the Time-Series data over time.

weekly = df.groupby(['store', pd.Grouper(key='ts', freq='W')], observed=True).agg(
    sales=('rev', 'sum'), orders=('order_id', 'nunique')
)

// It works to mount / expand windows per group

Always sort your data first and align to the timestamp column.

df = df.sort_values(['customer_id', 'ts'])
df['rev_30d_mean'] = (
    df.groupby('customer_id')
      .rolling('30D', on='ts')['rev'].mean()
      .reset_index(level=0, drop=True)
)

// To avoid data leakage

Maintain chronological order and ensure that Windows only “see” past data. Do not interpolate Time-Series data, and do not group statistics on the total data before segmentation for training and testing.

The obvious Status and top-n within groups

// Finding the top-k rows in each group

Here are two practical options for choosing the top N rows from each group.

# Sort + head
top3 = (df.sort_values(['cat', 'rev'], ascending=[True, False])
          .groupby('cat')
          .head(3))

# Per-group nlargest on one metric
top3_alt = (df.groupby('cat', group_keys=False)
              .apply(lambda g: g.nlargest(3, 'rev')))

// Uses Help functions

Pandas provides several Helper functions for level and selection.

position – Controls how obligations are managed (eg. method='dense' or 'first') and can calculate percentile levels with pct=True.

df['rev_rank_in_cat'] = df.groupby('cat')['rev'].rank(method='dense', ascending=False)

stealing – Gives the 0-based position of each line within its group.

df['pos_in_store'] = df.groupby('store').cumcount()

etc – Selects the K-th row by group without filtering the rest of the data.

second_row = df.groupby('store').nth(1)  # the second row present per store

The obvious Features of streaming by converting

// To perform a general tomopwing

Make a metric on the slope of each group so that the lines can be compared to different groups.

g = df.groupby('store')['rev']
df['rev_z'] = (df['rev'] - g.transform('mean')) / g.transform('std')

// Entering missing values

Fill in the missing values ​​with the group equation. This tends to keep the distribution closer to the original rather than using a global fill value.

df['price'] = df['price'].fillna(df.groupby('cat')['price'].transform('median'))

// Creating group sharing features

Convert raw numbers to intra-group averages for pure comparisons.

df['rev_share_in_store'] = df['rev'] / df.groupby('store')['rev'].transform('sum')

The obvious Handling categories, empty groups, and missing data

// Improving speed with types of sections

If your keys are from an organized set (eg This does GroupBy Faster performance and more memory.

from pandas.api.types import CategoricalDtype

store_type = CategoricalDtype(categories=sorted(df['store'].dropna().unique()), ordered=False)
df['store'] = df['store'].astype(store_type)

cat_type = CategoricalDtype(categories=['Grocery', 'Electronics', 'Home', 'Clothing', 'Sports'])
df['cat'] = df['cat'].astype(cat_type)

// Discard unused compound

When I edit the columns of the sections, the setting observed=True It excludes Sterry Mailys that actually do not occur in the data, resulting in a cleaner output with less noise.

out = df.groupby(['store', 'cat'], observed=True).size().reset_index(name="n")

// Meeting with Zan Buttons

State how you handle lost keys. By default, the pandas throw up NaN parties; Keep them only if they help with your quality assurance process.

# Default: NaN keys are dropped
by_default = df.groupby('region').size()

# Keep NaN as its own group when you need to audit missing keys
kept = df.groupby('region', dropna=False).size()

The obvious Quick cheat sheet

// Calculating the conditional value for each group

# mean of a boolean is a rate
df.groupby(keys).agg(rate=('flag', 'mean'))
# or explicitly: sum(mask)/size
df.groupby(keys).agg(rate=('flag', lambda s: s.sum() / s.size))

// What does counting mean

df.assign(wx=df[x] * df[w])
  .groupby(keys)
  .apply(lambda g: g['wx'].sum() / g[w].sum() if g[w].sum() else np.nan)
  .rename('wavg')

// Finding TOP-K for each group

(df.sort_values([key, metric], ascending=[True, False])
   .groupby(key)
   .head(k))
# or
df.groupby(key, group_keys=False).apply(lambda g: g.nlargest(k, metric))

// Calculating weekly metrics

df.groupby([key, pd.Grouper(key='ts', freq='W')], observed=True).agg(...)

// To do group filling

df[col] = df[col].fillna(df.groupby(keys)[col].transform('median'))

// Calculating the share within the group

df['share'] = df[val] / df.groupby(keys)[val].transform('sum')

The obvious Wrapping up

First, choose the appropriate mode for your work: Use agg reduce, transform broadcasting, and withholding apply Because there vet submission is not an option. lean on pd.Grouper With time-based buckets and top-n ranking helpers. By liking clear, reproducible patterns, you can keep your results flat, named, and easy to test, to ensure your metrics stay good and your notebooks run fast and your notebooks run quickly and your notebooks run quickly and your notebooks run quickly and your notebooks run quickly and your notebooks run quickly and your notebooks run quickly.

Josep Ferrer is an analytics engineer from Barcelona. He graduated in engineering physics and currently works in a data science company used for human mobility. He is a part-time content creator focused on science and technology. Josep writes on all things AI, covering the use of continuous explosions on the field.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button