ANI

SQL Window Functions Beyond the Basics: Solving Real Business Problems

# Introduction

Most of you use SQL window functions, but you're only scratching the surface — a ROW_NUMBER() here, a SUM() OVER() There. The power of window functions is revealed when you apply them to difficult problems. I'll walk you through four patterns that show window functions when they're most useful.

SQL Window Functions

The examples are all real interview questions that you can practice on StrataScratch.

# Running Rates

Calculating effective rates is one of the most commonly used business window functions. Finance people absolutely love it! It is used to track monthly cumulative income, which then easily feeds into calculating where you are against your annual income goal.

SQL Window Functions

What makes this a problem for windowing is that, in general, you have to enter both the period value and the cumulative total in the same output. You can not use GROUP BY with SUM()because that wraps individual lines. So, the obvious solution is to use the window function, that is, SUM() OVER().

// Example: Calculating Income Over Time

This Amazon question initially asks you to calculate a 3-month average. However, we will not ignore that and calculate the cumulative income for each month.

Data: Here it is amazon_purchases table preview.

User ID created_to purchase_amt
10 2020-01-01 3742
11 2020-01-04 1290
12 2020-01-07 4249
109 2020-10-24 1749

Code: An inner query converts dates into YYYY-MM formatting using TO_CHAR() and includes monthly income, filter benefits by WHERE purchase_amt > 0.

The outer query uses a window function over those total values ​​for the calculated month. I am not specifying a clear (on purpose) framing clause in it OVER()so the window function is automatic RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. That means the window is all the rows before the current row, that is, the month. In other words, the cumulative total is: all previous months + current month. No wonder that is something textbook definition of accumulated sum.

SELECT t.month,
       t.monthly_revenue,
       SUM(t.monthly_revenue) OVER(ORDER BY t.month) AS cumulative_revenue
FROM (
    SELECT TO_CHAR(created_at::DATE, 'YYYY-MM') AS month,
        SUM(purchase_amt) AS monthly_revenue
    FROM amazon_purchases
    WHERE purchase_amt > 0
    GROUP BY TO_CHAR(created_at::date, 'YYYY-MM')
    ORDER BY TO_CHAR(created_at::date, 'YYYY-MM')
) t
ORDER BY t.month ASC;

Output:

month monthly_income net_income
2020-01 26292 26292
2020-02 20695 46987
2020-03 29620 76607
2020-10 15310 239869

# Gaps and Islands (Session)

This pattern, too, involves sequential data, just like using values, but uses different window functions.

An An island is a series of parallel lineseg, consecutive daily logins. A the gap between the islands.

One of the most common real-world applications of this pattern sessionization – grouping raw event streams into sessions. Time is usually defined as a a sequence of events from the same user where no gap between consecutive events exceeds a certain timeout (30 minutes is standard for web analytics).

A session is usually used to product and data engineering. It is used wherever you need to group a stream of raw events into important units of work.

SQL Window Functions

Classic retrieval in SQL consists of two steps:

  • LAG() or LEAD() — to compare each line with the one before or after it, and flag where a new line begins.
  • SUM(flag) OVER (PARTITION BY user ORDER BY date) — to accumulate flags in the ID of the streak, as it remains flat within the streak and increases in all boundaries.

// Example: Finding User Routes

The question from LinkedIn and Meta discussions asks you to find the top three users with the longest series of visits to the forum until August 10, 2022. You must exclude all users with the longest series of three, if there is more than one user in each line length.

Data: The table user_streaks.

User ID date_visited
u001 2022-08-01
u001 2022-08-01
u004 2022-08-01
u005 2022-08-11

Code: The question is long, but it's well organized into CTEs, so it's easy to follow.

  1. unique_visits: Removes duplicate visit records and consolidates data on August 10, 2022.
  2. streak_flags: Using LAG() to find the previous visit date for each user and flag the row as 0 (continuation of the streak if the gap is 1 day) or 1 (new start of sequence for any other gap).
  3. streak_ids: Converts flags to streak group IDs using accumulated SUM().
  4. streak_lengths: Counts days in sequence.
  5. longest_per_user: Stores only the longest string for each user.
  6. ranked_lengths: Measures the length of a unique streak.
  7. top_lengths: Gets the maximum 3 values ​​of the streak length.

The final SELECT it covers everything: it shows all users with three high streaks and streak length in days.

WITH unique_visits AS (
    SELECT DISTINCT user_id, date_visited
    FROM   user_streaks
    WHERE  date_visited <= DATE '2022-08-10'),
streak_flags AS (
    SELECT *,
           CASE
               WHEN date_visited
                     - LAG(date_visited) OVER (PARTITION BY user_id ORDER BY date_visited) = 1
               THEN 0
               ELSE 1
           END AS new_streak
    FROM   unique_visits),
streak_ids AS (
    SELECT *,
           SUM(new_streak) OVER (PARTITION BY user_id ORDER BY date_visited) AS streak_id
    FROM   streak_flags),
streak_lengths AS (
    SELECT user_id,
           streak_id,
           COUNT(*) AS streak_length
    FROM   streak_ids
    GROUP  BY user_id, streak_id),
longest_per_user AS (
    SELECT user_id,
           MAX(streak_length) AS streak_length
    FROM   streak_lengths
    GROUP  BY user_id),
ranked_lengths AS (
    SELECT DISTINCT
           streak_length,
           DENSE_RANK() OVER (ORDER BY streak_length DESC) AS len_rank
    FROM   longest_per_user),
top_lengths AS (
    SELECT streak_length
    FROM   ranked_lengths
    WHERE  len_rank <= 3)
SELECT u.user_id,
       u.streak_length
FROM   longest_per_user u
JOIN   top_lengths       t USING (streak_length)
ORDER  BY u.streak_length DESC, u.user_id;

Output:

User ID streak_length
u004 10
u005 10
u003 5
u001 4
u006 4

# Cluster Analysis

A A group is a group of users who share an initial eventfor example, first purchase, first login, or first registration date. Analyzing cohorts is it basis for maintenance reportingas it answers the question of how many users returned after the first event.

SQL Window Functions

The main thing in cluster analysis is to find the the team's anchor in the user's activity history so that you can measure all subsequent activity against it.

Doing that in SQL boils down to three main window operations:

  1. MIN(event_time) OVER (PARTITION BY user_id) — the most common pattern where the anchor is the day.
  2. FIRST_VALUE(attribute) OVER (PARTITION BY user_id ORDER BY event_time) — used when you need the value of the pin itself, eg, the first seller or the first category of the product.
  3. ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) = 1 — is used when you want to split the first event as a separate line and merge it back into the full history rather than broadcasting to all lines.

// Example: Calculating First Orders

Here's a question for DoorDash. It requires you to count the number of orders and first orders (from the customer's point of view) that each seller has had. You should also exclude sellers who have not yet received orders.

Data: The first table is named order_details.

id customer_identity vendor_identity order_timestamp n_things earned_value
8 1049 6 2022-01-14 01:00:28 5 16.3
7 1049 5 2022-01-14 11:50:29 4 2.16
22 1049 1 2022-01-14 22:46:54 8 2.63
39 1060 1 2022-01-16 22:27:30 11 15.41

The second table says merchant_details.

id name category zip code
1 Treehouse pizza an American 92507
2 Thai lion asian 90017
3 Raven for dinner fast food 95204
7 The taste of Gyros the mediterranean 94789

Code: The first CTE is where group mentality occurs. I use the FIRST_VALUE() window function to paste the seller from each customer's previous order to all lines in their order history. The result is a table where each order carries a label of which vendor the customer started with.

In the second CTE, I join the labels back to the full order history using LEFT JOIN to ensure that sellers who received orders but never became anyone's first seller still appear in the result. We use COUNT() again DISTINCT counting only the customers for whom that salesperson was the first — that is the size of your group. It's one COUNT()you get the total number of orders. DISTINCT is needed here, too, because LEFT JOIN with first_order can generate duplicate order lines – since first_order it stores one row per order (not per customer), one order at a time order_details can match multiple lines in between first_order to the same customer, increasing the price without it.

In the finals SELECTwe join the number_of_customers CTE with merchant_details to deliver vendor names.

WITH first_order AS  (
SELECT customer_id,
       FIRST_VALUE(merchant_id) OVER(PARTITION BY customer_id ORDER BY order_timestamp) AS first_merchant
FROM order_details),
number_of_customers AS  (
SELECT merchant_id,
       COUNT(DISTINCT f.customer_id) AS first_time_orders,
       COUNT(DISTINCT id) AS total_number_of_orders
FROM order_details d
LEFT JOIN first_order f ON d.merchant_id = f.first_merchant
GROUP BY 1)
SELECT name,
       total_number_of_orders,
       first_time_orders
FROM number_of_customers
JOIN merchant_details ON number_of_customers.merchant_id = merchant_details.id;

Output:

name total_number_of_orders first_time_orders
Treehouse pizza 8 1
Thai lion 14 7
Raven for dinner 12 0
Burger A1 4 0
Sushi Bay 7 3
Tacos You 7 1

# Percentages and Rate Analysis

Combined functions tell you the ratio. Window-based level functions tell you the distribution, and the distribution is where the business questions of interest reside. Is your 90th percentile order value unusually high, suggesting that a few large buyers are stealing revenue? Are the bottom 25% of salespeople clustered near the median or far below?

NTILE(n) divides lines into lines n buckets are almost equal. PERCENT_RANK() expresses the level of each row as a value between 0 and 1. CUME_DIST() tells you which part of the rows has a value less than or equal to the current row. Again PERCENTILE_CONT() Calculates the actual value at a given percentile threshold – useful if you want to filter based on a variable cutoff rather than rank within a result set.

SQL Window Functions

// Example: Identifying High Percent Fraud

Here's one for Google and Netflix. They want you to identify the most suspicious claims in each state. It is assumed that the top 5% of claims in each state may be fraudulent.

Data: The table is named fraud_score.

policy_number situation claim_cost fraud_school
ABCD1001 CA 4113 0.61
ABCD1002 CA 3946 0.16
ABCD1003 CA 4335 0.01
ABCD1400 TX 3922 0.59

Code: In the code, PERCENTILE_CONT(0.95) It calculates the combined value at the 95th percentile of fraud scores within each state.

In the following SELECT statement, the CTE is combined with the original table so that every claim can be compared to its condition limit. Claims above that amount are cut off.

WITH state_percentiles AS (
    SELECT state,
           PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY fraud_score) AS p95
    FROM fraud_score
    GROUP BY state)
SELECT f.policy_num,
       f.state,
       f.claim_cost,
       f.fraud_score
FROM fraud_score f
JOIN state_percentiles sp
ON f.state = sp.state
WHERE f.fraud_score >= sp.p95;

Output:

policy_number situation claim_cost fraud_school
ABCD1016 CA 1639 0.96
ABCD1021 CA 4898 0.95
ABCD1027 CA 2663 0.99
ABCD1398 TX 3191 0.98

# The conclusion

These four patterns share a common philosophy: perform operations on the database, in one pass where possible, using the full expressive power of the SQL window definition.

What makes window functions truly powerful is not a single function in isolation. Consistency: you can combine CTEs, use multiple window functions at the same time SELECTand develop a sophisticated analytical logic that is almost identical to the definition of the business problem itself.

Nate Rosidi he is a data scientist and product strategist. He is also an adjunct professor of statistics, and the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Nate writes about the latest trends in the job market, provides interview advice, shares data science projects, and covers all things SQL.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button