ANI

Practical SQL Tricks Every Data Scientist Should Know

0 1 8 minutes read

Practical SQL Tricks Every Data Scientist Should Know

# Introduction

Focus only SELECT, WHEREagain GROUP BY It's enough to cover the basics, but many real-world analysis tasks require patterns that go beyond simple queries. Examples include finding sequential jobs, segmenting customers by spending stage, smoothing noisy time series data, or tracking system improvement methods across the board.

This article goes through 7 effective SQL patterns beyond the basics, focusing on techniques that solve real-world analytical problems.

# Sets the DataSet

We'll use a sample customer activity table from an innovative subscription software as a service (SaaS) company:

CREATE TABLE transactions (
    transaction_id   SERIAL PRIMARY KEY,
    customer_id      INT,
    plan_type        VARCHAR(20),   -- 'starter', 'pro', 'enterprise'
    amount           NUMERIC(10,2),
    status           VARCHAR(20),   -- 'completed', 'refunded', 'failed'
    created_at       TIMESTAMP
);

The complete dataset of 36 jobs in 7 clients, from September 2023 to June 2024, is available at seed.sql. Use it before you move on to the questions.

# 1. Estimating the Time Between Events by `LAG()`

LAG() again LEAD() allows you to access the previous or next value without joining them. They are especially useful for calculating gaps between events such as refresh cadence, churn signals, and re-engagement delays.

Work: Count how many days have passed between consecutive completed transactions for each customer.

SELECT
    customer_id,
    created_at,
    LAG(created_at) OVER (
        PARTITION BY customer_id
        ORDER BY created_at
    ) AS previous_transaction_at,
    ROUND(
        EXTRACT(EPOCH FROM (
            created_at - LAG(created_at) OVER (
                PARTITION BY customer_id
                ORDER BY created_at
            )
        )) / 86400
    ) AS days_since_last
FROM transactions
WHERE status="completed"
ORDER BY customer_id, created_at;

Output (reduced):

customer_id |     created_at      | previous_transaction_at | days_since_last
-------------+---------------------+-------------------------+-----------------
        3317 | 2024-01-03 11:02:00 |                         |
        3317 | 2024-03-15 10:45:00 | 2024-01-03 11:02:00     |              72
        3317 | 2024-05-22 09:30:00 | 2024-03-15 10:45:00     |              68
        4482 | 2023-09-10 09:00:00 |                         |
        4482 | 2023-10-10 09:00:00 | 2023-09-10 09:00:00     |              30
        4482 | 2023-11-10 09:14:00 | 2023-10-10 09:00:00     |              31
        4482 | 2024-01-03 09:14:00 | 2023-11-10 09:14:00     |              54
        4482 | 2024-03-03 08:20:00 | 2024-01-03 09:14:00     |              60
        4482 | 2024-04-03 10:00:00 | 2024-03-03 08:20:00     |              31
        4482 | 2024-05-01 11:00:00 | 2024-04-03 10:00:00     |              28
        ...
        7891 | 2024-02-01 09:00:00 |                         |
        7891 | 2024-04-01 09:00:00 | 2024-02-01 09:00:00     |              60
        7891 | 2024-05-15 09:00:00 | 2024-04-01 09:00:00     |              44
        8810 | 2024-01-05 12:00:00 |                         |
        8810 | 2024-02-05 12:00:00 | 2024-01-05 12:00:00     |              31
        8810 | 2024-04-05 12:00:00 | 2024-02-05 12:00:00     |              60
(29 rows)

The first line every customer has NULL in both columns — no previous incident will be considered. EXTRACT(EPOCH ...) converts the timestamp interval to seconds; to divide by 86400 gives dates.

LEAD() it works the same way but looks forward instead of backward, making it useful for calculating time-to-next refreshes or flagging the last transaction before churn.

# 2. Comparing Rows with Other Rows in the Same Table and Joining Yourself

A to join them relates rows within the same table to each other. It's the perfect tool if you need to compare two similar business events over time — upgrades, downgrades, reactivations, or any before/after pattern.

Work: Find customers who have progressed from beginner to pro (or pro to business) at any time.

SELECT DISTINCT t1.customer_id
FROM transactions t1
JOIN transactions t2
    ON  t1.customer_id = t2.customer_id
    AND t1.plan_type="starter"
    AND t2.plan_type="pro"
    AND t2.created_at  > t1.created_at
WHERE t1.status="completed"
  AND t2.status="completed"
ORDER BY t1.customer_id;

Output:

customer_id
-------------
        4482
        6204
        7891
(3 rows)

The table is translated twice (t1, t2) so that each noun can represent a different point in time for the same customer. The situation t2.created_at > t1.created_at enforces a temporary order – without it, you can match customers who had both types of plans in any way, including the wrong one. DISTINCT it wraps up situations where the customer has multiple initial jobs before the upgrade, which may generate duplicate rows.

This same structure works to find a discount, to find customers who have been abused and come back, or to compare any two regions that need to be ordered in time.

# 3. Selecting the top row for each group `ROW_NUMBER()`

If you need top-N rows per category — highest transaction per customer, most recent event per account, first purchase per batch — ROW_NUMBER() within a common table expression (CTE) common method.

Work: Find out what each customer has done the most.

WITH ranked AS (
    SELECT
        customer_id,
        transaction_id,
        amount,
        plan_type,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY amount DESC, created_at DESC
        ) AS rn
    FROM transactions
    WHERE status="completed"
)
SELECT customer_id, transaction_id, amount, plan_type
FROM ranked
WHERE rn = 1
ORDER BY customer_id;

Output:

customer_id  | transaction_id  | amount  | plan_type
-------------+----------------+--------+------------
        3317 |             12 |  19.00 | starter
        4482 |              8 | 299.00 | enterprise
        5901 |             19 | 299.00 | enterprise
        6103 |             25 | 299.00 | enterprise
        6204 |             28 |  79.00 | pro
        7891 |             32 |  79.00 | pro
        8810 |             36 |  79.00 | pro
(7 rows)

ROW_NUMBER() assigns 1 to the first sorting row between each segment. The outer query then filters on those rows only. The second filter is on created_at DESC serves as a tiebreaker; when two transactions have the same value, the most recent one wins.

If you want ties added rather than broken, swap ROW_NUMBER() for RANK(). RANK() assign the same number to the bound rows and skip the next level (1, 1, 3), while DENSE_RANK() do the same without skipping (1, 1, 2).

# 4. Segmenting Customers by Spending with `NTILE(n)`

NTILE(n) divides sorted rows into rows n approximately equal buckets and assign each row a bucket number. The perfect tool for measuring customers, breaking down quartiles, or creating clusters for A/B analysis without the limitations of hard coding.

Work: Rate customers into quartiles of spending based on their amount of work completed.

WITH customer_spend AS (
    SELECT
        customer_id,
        SUM(amount) AS total_spend,
        COUNT(*) AS total_transactions
    FROM transactions
    WHERE status="completed"
    GROUP BY customer_id
)
SELECT
    customer_id,
    total_spend,
    total_transactions,
    NTILE(4) OVER (ORDER BY total_spend) AS spend_quartile
FROM customer_spend
ORDER BY total_spend DESC;

Output:

customer_id | total_spend | total_transactions | spend_quartile
-------------+-------------+--------------------+----------------
        5901 |     1495.00 |                  5 |              4
        6103 |      835.00 |                  5 |              3
        4482 |      653.00 |                  7 |              3
        8810 |      237.00 |                  3 |              2
        6204 |      177.00 |                  3 |              2
        7891 |      177.00 |                  3 |              1
        3317 |       57.00 |                  3 |              1
(7 rows)

Quartile 4 can use your higher income; Quartile 1 is the lowest for you. NTILE() it doesn't use thresholds, so buckets scale automatically as new customers are added. This makes it more robust than similar static cutoffs CASE WHEN total_spend > 500.

# 5. Smooth Audio Data with Sliding Window

A rolling (or moving average) smooths out month-to-month volatility, making trends in time series data easier to read. The window works transparently ROWS BETWEEN the frame gives you precise control over how many times to insert.

Work: Calculate a 3-month rolling average of monthly income to denoise.

WITH monthly AS (
    SELECT
        DATE_TRUNC('month', created_at)::DATE AS month,
        SUM(amount) AS monthly_revenue
    FROM transactions
    WHERE status="completed"
    GROUP BY DATE_TRUNC('month', created_at)
)
SELECT
    month,
    monthly_revenue,
    ROUND(AVG(monthly_revenue) OVER (
        ORDER BY month
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ), 2) AS revenue_3mo_avg
FROM monthly
ORDER BY month;

Output:

month    | monthly_revenue | revenue_3mo_avg
-------------+-----------------+-----------------
 2023-09-01  |           19.00 |           19.00
 2023-10-01  |           19.00 |           19.00
 2023-11-01  |           79.00 |           39.00
 2024-01-01  |          275.00 |          124.33
 2024-02-01  |          476.00 |          276.67
 2024-03-01  |          555.00 |          435.33
 2024-04-01  |          835.00 |          622.00
 2024-05-01  |          775.00 |          721.67
 2024-06-01  |          598.00 |          736.00
(9 rows)

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW tells the window function to look at the current line and the two lines before it. The first two lines use fewer inputs since there is no previous history, so they work as 1-month and 2-month averages respectively.

Change ROWS for RANGE if you want to insert all the rows in parallel ORDER BY value (useful when multiple rows share a timestamp). For long-lasting smoothness, change 2 PRECEDING to 5 PRECEDING 6 month window.

# 6. Consolidation According to the conditions with `FILTER`

FILTER allows you to apply a WHERE situation in a particular aggregate without dividing the question into many sub-questions. The result is multiple conditional joins in one place over the data.

Work: Get total revenue, returns, and failed transaction statistics broken down by month — all in one row per month.

SELECT
    DATE_TRUNC('month', created_at) AS month,
    SUM(amount) FILTER (WHERE status="completed") AS revenue_completed,
    SUM(amount) FILTER (WHERE status="refunded")  AS revenue_refunded,
    COUNT(*)    FILTER (WHERE status="failed")    AS failed_count
FROM transactions
GROUP BY DATE_TRUNC('month', created_at)
ORDER BY month;

Output:

month             | revenue_completed | revenue_refunded | failed_count
------------------------+-------------------+------------------+--------------
 2023-09-01 00:00:00+00 |             19.00 |                  |            0
 2023-10-01 00:00:00+00 |             19.00 |                  |            0
 2023-11-01 00:00:00+00 |             79.00 |                  |            0
 2024-01-01 00:00:00+00 |            275.00 |                  |            0
 2024-02-01 00:00:00+00 |            476.00 |            79.00 |            1
 2024-03-01 00:00:00+00 |            555.00 |            79.00 |            0
 2024-04-01 00:00:00+00 |            835.00 |           299.00 |            0
 2024-05-01 00:00:00+00 |            775.00 |                  |            1
 2024-06-01 00:00:00+00 |            598.00 |                  |            2
(9 rows)

The exception to FILTER three different questions combined — more code, more readable, and generally slower. note that SUM with FILTER returns NULL (not zero) if there are no matching rows in a given month, which is correct: there were no returns for those months. Roll up COALESCE(..., 0) if you like zero.

FILTER it is standard SQL and works on PostgreSQL and BigQuery. For Snowflake and others, use SUM(CASE WHEN status="completed" THEN amount END) in turn.

# 7. Finding Sequential Tasks with Window Operations

Finding an unbroken sequence – active months without a gap, consecutive dates and transactions, subscription sequences – is one of SQL's trickiest problems. A classic solution uses the window function to group lines into paths without a repeating CTE.

How to do it: assign each active month a consecutive line number within the customer segmentation. If the months are actually consecutive, subtracting that row number from the day of the month produces the same number for each month in the series. The gap is always crossing.

Work: Get each customer's consecutive work months (months with at least one completed job).

WITH monthly_activity AS (
    SELECT
        customer_id,
        DATE_TRUNC('month', created_at)::DATE AS active_month
    FROM transactions
    WHERE status="completed"
    GROUP BY customer_id, DATE_TRUNC('month', created_at)
),
with_prev AS (
    SELECT
        customer_id,
        active_month,
        LAG(active_month) OVER (
            PARTITION BY customer_id
            ORDER BY active_month
        ) AS prev_month
    FROM monthly_activity
),
streak_groups AS (
    SELECT
        customer_id,
        active_month,
        SUM(CASE WHEN active_month = prev_month + INTERVAL '1 month' THEN 0 ELSE 1 END)
            OVER (PARTITION BY customer_id ORDER BY active_month) AS streak_id
    FROM with_prev
),
streaks AS (
    SELECT
        customer_id,
        streak_id,
        MIN(active_month) AS streak_start,
        MAX(active_month) AS streak_end,
        COUNT(*) AS streak_length_months
    FROM streak_groups
    GROUP BY customer_id, streak_id
)
SELECT customer_id, streak_start, streak_end, streak_length_months
FROM streaks
ORDER BY customer_id, streak_start;

Output:

customer_id | streak_start | streak_end | streak_length_months
-------------+--------------+------------+----------------------
        3317 | 2024-01-01   | 2024-01-01 |                    1
        3317 | 2024-03-01   | 2024-03-01 |                    1
        3317 | 2024-05-01   | 2024-05-01 |                    1
        4482 | 2023-09-01   | 2023-11-01 |                    3
        4482 | 2024-01-01   | 2024-01-01 |                    1
        4482 | 2024-03-01   | 2024-05-01 |                    3
        5901 | 2024-02-01   | 2024-06-01 |                    5
        6103 | 2024-01-01   | 2024-04-01 |                    4
        6103 | 2024-06-01   | 2024-06-01 |                    1
        6204 | 2024-01-01   | 2024-01-01 |                    1
        6204 | 2024-03-01   | 2024-03-01 |                    1
        6204 | 2024-05-01   | 2024-05-01 |                    1
        7891 | 2024-02-01   | 2024-02-01 |                    1
        7891 | 2024-04-01   | 2024-05-01 |                    2
        8810 | 2024-01-01   | 2024-02-01 |                    2
        8810 | 2024-04-01   | 2024-04-01 |                    1
(16 rows)

# A quick reference

These patterns apply to standard SQL without relying on specific database features, and appear frequently in analytical workflows such as retention analysis, funnel tracking optimization, and revenue reporting.

Tip	Time to Use It
`LAG()` / `LEAD()`	Time between events, before/after comparison for each entity
Join them	Detect transitions between states (upgrades, reactivations)
`ROW_NUMBER()`	Top-N rows per group, replication
`NTILE(n)`	Customer segmentation into spending/activity segments
A scrolling window (`ROWS BETWEEN`)	Time series with smooth noise, moving average
`FILTER`	Multiple conditional joins in one query pass
Sequential streak detection	Subscription tracking, retention analysis, session slots

Once you're comfortable with them, many of the multi-step data transformations typically handled in Python can be expressed cleanly and efficiently in a single SQL query.

Count Priya C is an engineer and technical writer from India. He loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, he works to learn and share his knowledge with the engineering community by authoring tutorials, how-to guides, ideas, and more. Bala also creates engaging resource overviews and code tutorials.

Source link

nimda 5 hours ago

0 1 8 minutes read