ANI

Most of them fail these SQL concepts in data conversations

Most of them fail these SQL concepts in data conversations
Photo for Author | Kanele

The interviewer is to find the best choice for advertised position. By doing so, they will happily set up the SQL questions discussed to see if they can get it cautiously. There are several SQL concepts where interventions will intervene.

I hope you will be one of those who avoid that conclusion, as I explain these ideas in detail, I finish examples how to solve some problems.

Most of them fail these SQL concepts in data conversationsMost of them fail these SQL concepts in data conversations

Obvious 1. Windows functions

Why It's Hard: Emergency Token Windows work Don't you really understand that Windows frames, separation, or ordering.

Normal Mistakes: Normal Error does not specify ORDER BY In Windows Activities or Windows activities, such as LEAD() either LAG()And you expect the question to work or the result is an intention.

For example: In this example, you need to find the second purchase users within 7 days of any previous purchase.

You can write this question.

WITH ordered_tx AS (
  SELECT user_id,
         created_at::date AS tx_date,
         LAG(created_at::DATE) OVER (PARTITION BY user_id) AS prev_tx_date
  FROM amazon_transactions
)

SELECT DISTINCT user_id
FROM ordered_tx
WHERE prev_tx_date IS NOT NULL AND tx_date - prev_tx_date <= 7;

When you first look, everything may seem ok. The code means even from something that can be seen is the correct answer.

Windows activitiesWindows activities

First, we are fortunate that the code is at all! This happened because I wrote it Postgresql. For some other SQL taste, you will get a mistake from ORDER BY It is forced to windows' activities.

Second, the outgoing is wrong; I have highlighted some lines that should not be there. Why did he come from, then?

Appear because we have not clarified ORDER BY subsection in LAG() Windows work. Without it, the line order has a dispute. Therefore, we compare current transactions on the front row of that user, not the one who happened immediately early.

This is not something to be asked. We need to compare each previous transaction per day. In other words, we need to determine this clearly in ORDER BY paragraph within LAG() work.

WITH ordered_tx AS (
  SELECT user_id,
         created_at::date AS tx_date,
         LAG(created_at::DATE) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_tx_date
  FROM amazon_transactions
)

SELECT DISTINCT user_id
FROM ordered_tx
WHERE prev_tx_date IS NOT NULL AND tx_date - prev_tx_date <= 7;

Obvious 2. Sorting in conjunctions (especially to have vs. Where)

Why is it difficult: People often misunderstand SQL murder order, that is: FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY. This order means that WHERE filters lines before compiling, and HAVING filters in the back. That again, logically, it means you can't use integrated activities in WHERE the clause.

A common error: attempt to use integrated activities in WHERE To the question collected and get an error.

Example: This survey question has asked you to get the full amount made per win. Only wineries when 90 are the lowest number of points of any of its types taken consideration.

Many will see this as a simple question and immediately write the question.

SELECT winery,
       variety,
       SUM(price) AS total_revenue
FROM winemag_p1
WHERE MIN(points) >= 90
GROUP BY winery, variety
ORDER BY winery, total_revenue DESC;

However, that code will throw an error meaning combined activities are not allowed in the WHERE the clause. This is good to describe everything. Solution? Submit the filtering status from WHERE above HAVING.

SELECT winery,
       variety,
       SUM(price) AS total_revenue
FROM winemag_p1
GROUP BY winery, variety
HAVING MIN(points) >= 90
ORDER BY winery, total_revenue DESC;

Obvious 3. Association with self-comparisons based on time or event

Why is it difficult: the idea of To join the table itself It cannot be very comparable, so voters often forget the option.

General error: Using subqueries and treating a question when you join the table itself can be easier and quickly, especially when filtering in days or events.

For example: Here is a question that asks if you have shown a change of cash exchange between 1 January 2020 and 1 July 2020.

You can solve this by lower than the connecting subjects followed by July 1, and release the exchange rates 1, from internal decoration.

SELECT jan_rates.source_currency,
  (SELECT exchange_rate 
   FROM sf_exchange_rate 
   WHERE source_currency = jan_rates.source_currency AND date="2020-07-01") - jan_rates.exchange_rate AS difference
FROM (SELECT source_currency, exchange_rate
      FROM sf_exchange_rate
      WHERE date="2020-01-01"
) AS jan_rates;

This returns the right result, but such solution is unnecessarily complicated. The simplest solution, with a few of the code lines, including association with the table itself and using two sorting situations on the day WHERE the clause.

SELECT jan.source_currency,
       jul.exchange_rate - jan.exchange_rate AS difference
FROM sf_exchange_rate jan
JOIN sf_exchange_rate jul ON jan.source_currency = jul.source_currency
WHERE jan.date="2020-01-01" AND jul.date="2020-07-01";

Obvious 4

Why it is difficult: People often hold on to the lesser detects because they read them before the table sound (CTS) and continue to use them in any question with a specified logic. However, the good is less likely to attract too fast.

Normal Error: Deep Use SELECT Statements when CTES will be very easy.

Example: In the survey question from Google Netflix, you need to find higher actors according to their movie scale inside the type of appearance.

The solution uses standing CITES as follows.

WITH genre_stats AS
  (SELECT actor_name,
          genre,
          COUNT(*) AS movie_count,
          AVG(movie_rating) AS avg_rating
   FROM top_actors_rating
   GROUP BY actor_name,
            genre),
            
max_genre_count AS
  (SELECT actor_name,
          MAX(movie_count) AS max_count
   FROM genre_stats
   GROUP BY actor_name),
     
top_genres AS
  (SELECT gs.*
   FROM genre_stats gs
   JOIN max_genre_count mgc ON gs.actor_name = mgc.actor_name
   AND gs.movie_count = mgc.max_count),
     
top_genre_avg AS
  (SELECT actor_name,
          MAX(avg_rating) AS max_avg_rating
   FROM top_genres
   GROUP BY actor_name),
   
filtered_top_genres AS
  (SELECT tg.*
   FROM top_genres tg
   JOIN top_genre_avg tga ON tg.actor_name = tga.actor_name
   AND tg.avg_rating = tga.max_avg_rating),
     ranked_actors AS
  (SELECT *,
          DENSE_RANK() OVER (
                             ORDER BY avg_rating DESC) AS rank
   FROM filtered_top_genres),
   
final_selection AS
  (SELECT MAX(rank) AS max_rank
   FROM ranked_actors
   WHERE rank <= 3)
   
SELECT actor_name,
       genre,
       avg_rating
FROM ranked_actors
WHERE rank <=
    (SELECT max_rank
     FROM final_selection);

It is complicated, but six clear CITES, with advanced code readings by clear analysis.

Want to know what the same solution can we look exclusively use findings? Here it's.

SELECT ra.actor_name,
       ra.genre,
       ra.avg_rating
FROM (
    SELECT *,
           DENSE_RANK() OVER (ORDER BY avg_rating DESC) AS rank
    FROM (
        SELECT tg.*
        FROM (
            SELECT gs.*
            FROM (
                SELECT actor_name,
                       genre,
                       COUNT(*) AS movie_count,
                       AVG(movie_rating) AS avg_rating
                FROM top_actors_rating
                GROUP BY actor_name, genre
            ) AS gs
            JOIN (
                SELECT actor_name,
                       MAX(movie_count) AS max_count
                FROM (
                    SELECT actor_name,
                           genre,
                           COUNT(*) AS movie_count,
                           AVG(movie_rating) AS avg_rating
                    FROM top_actors_rating
                    GROUP BY actor_name, genre
                ) AS genre_stats
                GROUP BY actor_name
            ) AS mgc
            ON gs.actor_name = mgc.actor_name AND gs.movie_count = mgc.max_count
        ) AS tg
        JOIN (
            SELECT actor_name,
                   MAX(avg_rating) AS max_avg_rating
            FROM (
                SELECT gs.*
                FROM (
                    SELECT actor_name,
                           genre,
                           COUNT(*) AS movie_count,
                           AVG(movie_rating) AS avg_rating
                    FROM top_actors_rating
                    GROUP BY actor_name, genre
                ) AS gs
                JOIN (
                    SELECT actor_name,
                           MAX(movie_count) AS max_count
                    FROM (
                        SELECT actor_name,
                               genre,
                               COUNT(*) AS movie_count,
                               AVG(movie_rating) AS avg_rating
                        FROM top_actors_rating
                        GROUP BY actor_name, genre
                    ) AS genre_stats
                    GROUP BY actor_name
                ) AS mgc
                ON gs.actor_name = mgc.actor_name AND gs.movie_count = mgc.max_count
            ) AS top_genres
            GROUP BY actor_name
        ) AS tga
        ON tg.actor_name = tga.actor_name AND tg.avg_rating = tga.max_avg_rating
    ) AS filtered_top_genres
) AS ra
WHERE ra.rank <= (
    SELECT MAX(rank)
    FROM (
        SELECT *,
               DENSE_RANK() OVER (ORDER BY avg_rating DESC) AS rank
        FROM (
            SELECT tg.*
            FROM (
                SELECT gs.*
                FROM (
                    SELECT actor_name,
                           genre,
                           COUNT(*) AS movie_count,
                           AVG(movie_rating) AS avg_rating
                    FROM top_actors_rating
                    GROUP BY actor_name, genre
                ) AS gs
                JOIN (
                    SELECT actor_name,
                           MAX(movie_count) AS max_count
                    FROM (
                        SELECT actor_name,
                               genre,
                               COUNT(*) AS movie_count,
                               AVG(movie_rating) AS avg_rating
                        FROM top_actors_rating
                        GROUP BY actor_name, genre
                    ) AS genre_stats
                    GROUP BY actor_name
                ) AS mgc
                ON gs.actor_name = mgc.actor_name AND gs.movie_count = mgc.max_count
            ) AS tg
            JOIN (
                SELECT actor_name,
                       MAX(avg_rating) AS max_avg_rating
                FROM (
                    SELECT gs.*
                    FROM (
                        SELECT actor_name,
                               genre,
                               COUNT(*) AS movie_count,
                               AVG(movie_rating) AS avg_rating
                        FROM top_actors_rating
                        GROUP BY actor_name, genre
                    ) AS gs
                    JOIN (
                        SELECT actor_name,
                               MAX(movie_count) AS max_count
                        FROM (
                            SELECT actor_name,
                                   genre,
                                   COUNT(*) AS movie_count,
                                   AVG(movie_rating) AS avg_rating
                            FROM top_actors_rating
                            GROUP BY actor_name, genre
                        ) AS genre_stats
                        GROUP BY actor_name
                    ) AS mgc
                    ON gs.actor_name = mgc.actor_name AND gs.movie_count = mgc.max_count
                ) AS top_genres
                GROUP BY actor_name
            ) AS tga
            ON tg.actor_name = tga.actor_name AND tg.avg_rating = tga.max_avg_rating
        ) AS filtered_top_genres
    ) AS ranked_actors
    WHERE rank <= 3
);

There is an unwanted concept that has repeated the other side. How many good distances? I don't know. The code is impossible to keep it. Even if I write it, I still need half the day to understand if I want to change something tomorrow. Additionally, alileleys are completely meaningful.

Obvious 5. Handling NULLS in LOGIC

Why is it difficult: The elections think that NULL equal to something. Not. NULL It is not equal to anything – not himself. Logic involving NULLbehaves differently from logic involving real prices.

Normal error: Using = NULL instead of IS NULL in the pursuit of or in the missing lines because NULLS break themorticong logic.

Example: There is a question of the IBM to ask you to calculate the total number of collaborative and the full amount of content for each customer.

It does not sound very cheated, so you can write this solution to two CTEs, where one Cte ​​is calculated for each customer's number, and the other lists the number of content. Last SELECTyou FULL OUTER JOIN Two CITES, and you have a solution. Right?

WITH interactions_summary AS
  (SELECT customer_id,
          COUNT(*) AS total_interactions
   FROM customer_interactions
   GROUP BY customer_id),
   
content_summary AS
  (SELECT customer_id,
          COUNT(*) AS total_content_items
   FROM user_content
   GROUP BY customer_id)
   
SELECT i.customer_id,
  i.total_interactions,
  c.total_content_items
FROM interactions_summary AS i
FULL OUTER JOIN content_summary AS c ON i.customer_id = c.customer_id
ORDER BY customer_id;

Almost right. Here is the result. (In a way, you see the maximum quotes (“”) instead NULL. This is how stratascratch UI shows it, but trust me, the engine still treats them what they are: NULL prices).

5. Handling NULLS in LOGIC5. Handling NULLS in LOGIC

Highlighted lines contain NULLs. This makes going out wrong. A NULL Number of Customer ID or Working Number and Content, which is a question that clearly asks you show.

What we lost in the above solution COALESCE() control NULLFinal SELECT. Now, all customers without interaction will receive their IDs from content_summary CTE. Also, in non-interactions, or content, or both, we will now take place NULL by 0, which is a valid number.

WITH interactions_summary AS
  (SELECT customer_id,
          COUNT(*) AS total_interactions
   FROM customer_interactions
   GROUP BY customer_id),
   
content_summary AS
  (SELECT customer_id,
          COUNT(*) AS total_content_items
   FROM user_content
   GROUP BY customer_id)
   
SELECT COALESCE(i.customer_id, c.customer_id) AS customer_id,
       COALESCE(i.total_interactions, 0) AS total_interactions,
       COALESCE(c.total_content_items, 0) AS total_content_items
FROM interactions_summary AS i
FULL OUTER JOIN content_summary AS c ON i.customer_id = c.customer_id
ORDER BY customer_id;

Obvious 6. Deleted of the group derived

Why is it difficult: The group's deduction means you choose one line in each group, e.g. But you can't use GROUP BY Unless you combine. On the other hand, you usually need a full line, not one number of combinations and GROUP BY back.

Normal error: Using GROUP BY + Celebrate LIMIT 1 (or Different fromwhich is postgresql-special) instead ROW_NUMBER() either RANK()Last if you want bonds included.

Example: This question is to ask you to identify the most traditional object per month, and there is no need to separate months a year. The most selling thing is counted as unitprice * quantity.

The unrealistic approach will be this. First, remove the Month of Sales from invoicedateselect descriptionand get complete sale in summation unitprice * quantity. Then, to get a monthly sale and product meaning, just GROUP BY Those two columns. Finally, we need to use only ORDER BY Outgoing editing from the best of the most commercial product and used LIMIT 1 To remove the first line only, that is, the best-selling thing.

SELECT DATE_PART('MONTH', invoicedate) AS sale_month,
       description,
       SUM(unitprice * quantity) AS total_paid
FROM online_retail
GROUP BY sale_month, description
ORDER BY total_paid DESC
LIMIT 1;

As I said, this is irrational; Release is similar to what we need, but we need this month every time, not only one.

Disclosure from the groupDisclosure from the group

One of the best ways to use RANK() Windows work. In this way, we follow the same method as the previous code. The difference is that the question is now becoming more than FROM the clause. In addition, we use RANK() Distinguishing details per month and put the lines within each parting (ie, per month separately) in the best selling in the most commercial object.

After that, on the main question, we simply choose the required columns and only remove lines only where the position is 1 uses WHERE the clause.

SELECT month,
       description,
       total_paid
FROM
  (SELECT DATE_PART('month', invoicedate) AS month,
          description,
          SUM(unitprice * quantity) AS total_paid,
          RANK() OVER (PARTITION BY DATE_PART('month', invoicedate) ORDER BY SUM(unitprice * quantity) DESC) AS rnk
   FROM online_retail
   GROUP BY month, description) AS tmp
WHERE rnk = 1;

Obvious Store

Six concepts covered frequently in the questions of the SQL conversation. Pay attention to attention, then practice questions that discuss these concepts, learn the correct ways, and you will improve your chances in your conversations.

Nate Rosid He is a data scientist and product plan. He is a person who is an educated educator, and the Founder of Stratascratch, a stage that helps data scientists prepare their conversations with the highest discussion of the chat. Nate writes the latest stylies in the work market, offers chat advice, sharing data science projects, and covered everything SQL.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button