SQL Window Functions Beyond the Basics: Solving Real Business Problems

# Introduction
Most of you use SQL window functions, but you're only scratching the surface — a ROW_NUMBER() here, a SUM() OVER() There. The power of window functions is revealed when you apply them to difficult problems. I'll walk you through four patterns that show window functions when they're most useful.
The examples are all real interview questions that you can practice on StrataScratch.
# Running Rates
Calculating effective rates is one of the most commonly used business window functions. Finance people absolutely love it! It is used to track monthly cumulative income, which then easily feeds into calculating where you are against your annual income goal.
What makes this a problem for windowing is that, in general, you have to enter both the period value and the cumulative total in the same output. You can not use GROUP BY with SUM()because that wraps individual lines. So, the obvious solution is to use the window function, that is, SUM() OVER().
// Example: Calculating Income Over Time
This Amazon question initially asks you to calculate a 3-month average. However, we will not ignore that and calculate the cumulative income for each month.
Data: Here it is amazon_purchases table preview.
| User ID | created_to | purchase_amt |
|---|---|---|
| 10 | 2020-01-01 | 3742 |
| 11 | 2020-01-04 | 1290 |
| 12 | 2020-01-07 | 4249 |
| … | … | … |
| 109 | 2020-10-24 | 1749 |
Code: An inner query converts dates into YYYY-MM formatting using TO_CHAR() and includes monthly income, filter benefits by WHERE purchase_amt > 0.
The outer query uses a window function over those total values for the calculated month. I am not specifying a clear (on purpose) framing clause in it OVER()so the window function is automatic RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. That means the window is all the rows before the current row, that is, the month. In other words, the cumulative total is: all previous months + current month. No wonder that is something textbook definition of accumulated sum.
SELECT t.month,
t.monthly_revenue,
SUM(t.monthly_revenue) OVER(ORDER BY t.month) AS cumulative_revenue
FROM (
SELECT TO_CHAR(created_at::DATE, 'YYYY-MM') AS month,
SUM(purchase_amt) AS monthly_revenue
FROM amazon_purchases
WHERE purchase_amt > 0
GROUP BY TO_CHAR(created_at::date, 'YYYY-MM')
ORDER BY TO_CHAR(created_at::date, 'YYYY-MM')
) t
ORDER BY t.month ASC;
Output:
| month | monthly_income | net_income |
|---|---|---|
| 2020-01 | 26292 | 26292 |
| 2020-02 | 20695 | 46987 |
| 2020-03 | 29620 | 76607 |
| … | … | … |
| 2020-10 | 15310 | 239869 |
# Gaps and Islands (Session)
This pattern, too, involves sequential data, just like using values, but uses different window functions.
An An island is a series of parallel lineseg, consecutive daily logins. A the gap between the islands.
One of the most common real-world applications of this pattern sessionization – grouping raw event streams into sessions. Time is usually defined as a a sequence of events from the same user where no gap between consecutive events exceeds a certain timeout (30 minutes is standard for web analytics).
A session is usually used to product and data engineering. It is used wherever you need to group a stream of raw events into important units of work.
Classic retrieval in SQL consists of two steps:
LAG()orLEAD()— to compare each line with the one before or after it, and flag where a new line begins.SUM(flag) OVER (PARTITION BY user ORDER BY date)— to accumulate flags in the ID of the streak, as it remains flat within the streak and increases in all boundaries.
// Example: Finding User Routes
The question from LinkedIn and Meta discussions asks you to find the top three users with the longest series of visits to the forum until August 10, 2022. You must exclude all users with the longest series of three, if there is more than one user in each line length.
Data: The table user_streaks.
| User ID | date_visited |
|---|---|
| u001 | 2022-08-01 |
| u001 | 2022-08-01 |
| u004 | 2022-08-01 |
| … | … |
| u005 | 2022-08-11 |
Code: The question is long, but it's well organized into CTEs, so it's easy to follow.
unique_visits: Removes duplicate visit records and consolidates data on August 10, 2022.streak_flags: UsingLAG()to find the previous visit date for each user and flag the row as0(continuation of the streak if the gap is 1 day) or1(new start of sequence for any other gap).streak_ids: Converts flags to streak group IDs using accumulatedSUM().streak_lengths: Counts days in sequence.longest_per_user: Stores only the longest string for each user.ranked_lengths: Measures the length of a unique streak.top_lengths: Gets the maximum 3 values of the streak length.
The final SELECT it covers everything: it shows all users with three high streaks and streak length in days.
WITH unique_visits AS (
SELECT DISTINCT user_id, date_visited
FROM user_streaks
WHERE date_visited <= DATE '2022-08-10'),
streak_flags AS (
SELECT *,
CASE
WHEN date_visited
- LAG(date_visited) OVER (PARTITION BY user_id ORDER BY date_visited) = 1
THEN 0
ELSE 1
END AS new_streak
FROM unique_visits),
streak_ids AS (
SELECT *,
SUM(new_streak) OVER (PARTITION BY user_id ORDER BY date_visited) AS streak_id
FROM streak_flags),
streak_lengths AS (
SELECT user_id,
streak_id,
COUNT(*) AS streak_length
FROM streak_ids
GROUP BY user_id, streak_id),
longest_per_user AS (
SELECT user_id,
MAX(streak_length) AS streak_length
FROM streak_lengths
GROUP BY user_id),
ranked_lengths AS (
SELECT DISTINCT
streak_length,
DENSE_RANK() OVER (ORDER BY streak_length DESC) AS len_rank
FROM longest_per_user),
top_lengths AS (
SELECT streak_length
FROM ranked_lengths
WHERE len_rank <= 3)
SELECT u.user_id,
u.streak_length
FROM longest_per_user u
JOIN top_lengths t USING (streak_length)
ORDER BY u.streak_length DESC, u.user_id;
Output:
| User ID | streak_length |
|---|---|
| u004 | 10 |
| u005 | 10 |
| u003 | 5 |
| u001 | 4 |
| u006 | 4 |
# Cluster Analysis
A A group is a group of users who share an initial eventfor example, first purchase, first login, or first registration date. Analyzing cohorts is it basis for maintenance reportingas it answers the question of how many users returned after the first event.
The main thing in cluster analysis is to find the the team's anchor in the user's activity history so that you can measure all subsequent activity against it.
Doing that in SQL boils down to three main window operations:
MIN(event_time) OVER (PARTITION BY user_id)— the most common pattern where the anchor is the day.FIRST_VALUE(attribute) OVER (PARTITION BY user_id ORDER BY event_time)— used when you need the value of the pin itself, eg, the first seller or the first category of the product.ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) = 1— is used when you want to split the first event as a separate line and merge it back into the full history rather than broadcasting to all lines.
// Example: Calculating First Orders
Here's a question for DoorDash. It requires you to count the number of orders and first orders (from the customer's point of view) that each seller has had. You should also exclude sellers who have not yet received orders.
Data: The first table is named order_details.
| id | customer_identity | vendor_identity | order_timestamp | n_things | earned_value |
|---|---|---|---|---|---|
| 8 | 1049 | 6 | 2022-01-14 01:00:28 | 5 | 16.3 |
| 7 | 1049 | 5 | 2022-01-14 11:50:29 | 4 | 2.16 |
| 22 | 1049 | 1 | 2022-01-14 22:46:54 | 8 | 2.63 |
| … | … | … | … | … | … |
| 39 | 1060 | 1 | 2022-01-16 22:27:30 | 11 | 15.41 |
The second table says merchant_details.
| id | name | category | zip code |
|---|---|---|---|
| 1 | Treehouse pizza | an American | 92507 |
| 2 | Thai lion | asian | 90017 |
| 3 | Raven for dinner | fast food | 95204 |
| … | … | … | … |
| 7 | The taste of Gyros | the mediterranean | 94789 |
Code: The first CTE is where group mentality occurs. I use the FIRST_VALUE() window function to paste the seller from each customer's previous order to all lines in their order history. The result is a table where each order carries a label of which vendor the customer started with.
In the second CTE, I join the labels back to the full order history using LEFT JOIN to ensure that sellers who received orders but never became anyone's first seller still appear in the result. We use COUNT() again DISTINCT counting only the customers for whom that salesperson was the first — that is the size of your group. It's one COUNT()you get the total number of orders. DISTINCT is needed here, too, because LEFT JOIN with first_order can generate duplicate order lines – since first_order it stores one row per order (not per customer), one order at a time order_details can match multiple lines in between first_order to the same customer, increasing the price without it.
In the finals SELECTwe join the number_of_customers CTE with merchant_details to deliver vendor names.
WITH first_order AS (
SELECT customer_id,
FIRST_VALUE(merchant_id) OVER(PARTITION BY customer_id ORDER BY order_timestamp) AS first_merchant
FROM order_details),
number_of_customers AS (
SELECT merchant_id,
COUNT(DISTINCT f.customer_id) AS first_time_orders,
COUNT(DISTINCT id) AS total_number_of_orders
FROM order_details d
LEFT JOIN first_order f ON d.merchant_id = f.first_merchant
GROUP BY 1)
SELECT name,
total_number_of_orders,
first_time_orders
FROM number_of_customers
JOIN merchant_details ON number_of_customers.merchant_id = merchant_details.id;
Output:
| name | total_number_of_orders | first_time_orders |
|---|---|---|
| Treehouse pizza | 8 | 1 |
| Thai lion | 14 | 7 |
| Raven for dinner | 12 | 0 |
| Burger A1 | 4 | 0 |
| Sushi Bay | 7 | 3 |
| Tacos You | 7 | 1 |
# Percentages and Rate Analysis
Combined functions tell you the ratio. Window-based level functions tell you the distribution, and the distribution is where the business questions of interest reside. Is your 90th percentile order value unusually high, suggesting that a few large buyers are stealing revenue? Are the bottom 25% of salespeople clustered near the median or far below?
NTILE(n) divides lines into lines n buckets are almost equal. PERCENT_RANK() expresses the level of each row as a value between 0 and 1. CUME_DIST() tells you which part of the rows has a value less than or equal to the current row. Again PERCENTILE_CONT() Calculates the actual value at a given percentile threshold – useful if you want to filter based on a variable cutoff rather than rank within a result set.
// Example: Identifying High Percent Fraud
Here's one for Google and Netflix. They want you to identify the most suspicious claims in each state. It is assumed that the top 5% of claims in each state may be fraudulent.
Data: The table is named fraud_score.
| policy_number | situation | claim_cost | fraud_school |
|---|---|---|---|
| ABCD1001 | CA | 4113 | 0.61 |
| ABCD1002 | CA | 3946 | 0.16 |
| ABCD1003 | CA | 4335 | 0.01 |
| … | … | … | … |
| ABCD1400 | TX | 3922 | 0.59 |
Code: In the code, PERCENTILE_CONT(0.95) It calculates the combined value at the 95th percentile of fraud scores within each state.
In the following SELECT statement, the CTE is combined with the original table so that every claim can be compared to its condition limit. Claims above that amount are cut off.
WITH state_percentiles AS (
SELECT state,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY fraud_score) AS p95
FROM fraud_score
GROUP BY state)
SELECT f.policy_num,
f.state,
f.claim_cost,
f.fraud_score
FROM fraud_score f
JOIN state_percentiles sp
ON f.state = sp.state
WHERE f.fraud_score >= sp.p95;
Output:
| policy_number | situation | claim_cost | fraud_school |
|---|---|---|---|
| ABCD1016 | CA | 1639 | 0.96 |
| ABCD1021 | CA | 4898 | 0.95 |
| ABCD1027 | CA | 2663 | 0.99 |
| … | … | … | … |
| ABCD1398 | TX | 3191 | 0.98 |
# The conclusion
These four patterns share a common philosophy: perform operations on the database, in one pass where possible, using the full expressive power of the SQL window definition.
What makes window functions truly powerful is not a single function in isolation. Consistency: you can combine CTEs, use multiple window functions at the same time SELECTand develop a sophisticated analytical logic that is almost identical to the definition of the business problem itself.
Nate Rosidi he is a data scientist and product strategist. He is also an adjunct professor of statistics, and the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Nate writes about the latest trends in the job market, provides interview advice, shares data science projects, and covers all things SQL.



