ANI

7 Duckb SQL Queries Saving Pandas Hours


Photo by writer | Kanele

Pandas Library is one of the fastest-growing communities. This popularity has opened the door of other ways, such as polars. In this article, we will examine such a way, DuckdB.

DuckdB is a SQL database that can run directly into your text book. No setup required, and no servers are required. It is easy to install and can work with pandas alike.

Unlike other SQL details, you do not need to stop the server. It is only valid for your booklet after installation. That means there is no heads for a local setup, writing the code right away. The duckdB treats sorting, joining, and combinations with a clean SQL syntax, compared to pandas, and is best done in large dataset.

It is enough for goals, let's get started!

Data project – Uber Business Modeling

We will use it with Jobyter NeTebook, including Python for data analysis. Making things are more fun, we will work with a real data project. Let's get started!

Example of Duckb SQL questionnaire project

Here is a link to the data project we will use in this article. It is a data project from Uber called Percer Business Modeling.

The Uber has used this data project in the process of hiring data science positions, and you will be asked to analyze two different circumstances.

  • Status 1: Compare the cost of two bonus programs designed for accessing more drivers online at a busy day.
  • Status 2: Count and compare income of a traditional taxi driver VS person who works with Uber and buys a car.

Loading data

Let's load the datframe for the first time. This step will be required; Therefore, we will register this data with duckdB in the following sections.

import pandas as pd
df = pd.read_csv("dataset_2.csv")

To check the dataset

Here are the first few lines:

Example of Duckb SQL questionnaire project

Let's look at all columns.

Here's the outgoing.

Example of Duckb SQL questionnaire project

Connect Duckd and register for data data

Good, is the right way, but how can we connect the duckdb beyond the data?
First, if you have not installed it yet, enter the DuckdB.

Connection with DuckdB is easy. Also, if you want to read the documents, check it here.

Now, here is a connection code and sign up for data data.

import duckdb
con = duckdb.connect()

con.register("my_data", df)

Connect Duckd and register for data data

Well, let's start examining the seven questions to save many hours of Pattas job!

1. Most multiple multiple sorting

One of the most important benefits of SQL is the way they are naturally handling sorting, especially to filter a lot of situations, easily.

The implementation of multi-cliltial filters in Duckd vs pandas

DuckdB allows you to use multiple filters using SQL when Claases and details, which is good as the number of filters grow.

SELECT 
    *
FROM data
WHERE condition_1
  AND condition_2
  AND condition_3
  AND condition_4

Now let's see how we write the same logic on pandas. In Pandas, a small logic is shown using the tapped mask, which can get a perfose under many circumstances.

filtered_df = df[
    (df["condition_1"]) &
    (df["condition_2"]) &
    (df["condition_3"]) &
    (df["condition_4"])
]

Both methods read equally and apply to basic use. DuckdB feels the natural nature and clean as logic becomes more complicated.

Multi-Cliteria Fempowing for Uber Data Project

In this case, we want to find drivers who receive a particular Bonus Bonus System.

According to the rules, drivers must:

  • Be online at least 8 hours
  • Complete at least 10 trip
  • Accept at least 90% of the supply application
  • To have a ratio of 4.7 or more

Now all we have to do to write a question does all this sort. Here is the code.

SELECT 
    COUN(*) AS qualified_drivers,
    COUNT(*) * 50 AS total_payout
FROM data
WHERE "Supply Hours" >= 8
  AND CAST(REPLACE("Accept Rate", '%', '') AS DOUBLE) >= 90
  AND "Trips Completed" >= 10
  AND Rating >= 4.7

But to make this code with Python, we need to add Con.execute (“” “” “” tracks ()

con.execute("""
SELECT 
    COUNT(*) AS qualified_drivers,
    COUNT(*) * 50 AS total_payout
FROM data
WHERE "Supply Hours" >= 8
  AND CAST(REPLACE("Accept Rate", '%', '') AS DOUBLE) >= 90
  AND "Trips Completed" >= 10
  AND Rating >= 4.7
""").fetchdf()

We will do this in every article. Now that you know how to do it on Josyter NeTebook, we will only show SQL code from now, and you will know how to change the Pythonic version.
Good. Now, remember that the data project wants to calculate the full amount of payment 1.

Multi-Criteria sorting

Mount the driver's sum, but should increase this for $ 50, because payment will be $ 50 for each driver, so we will do it by counting
* 50.

Here's the outgoing.

Multi-Criteria sorting

2. The fastest combination to measure business encouragement

The SQL is ready to combine faster, especially when you need to summarize the data to all lines.

Implementation of integration in Duckd VS PANDAS

SELECT 
    COUNT(*) AS num_rows,
    SUM(column_name) AS total_value
FROM data
WHERE some_condition

DuckdB allows you to combine the amounts of all the lines using SQL activities like the sum and counting in one compound block.

filtered = df[df["some_condition"]]
num_rows = filtered.shape[0]
total_value = filtered["column_name"].sum()

In Padas, first you need to sort the DataFrame, and list itself and SUM using last modes.

DuckdB is too short and easy to read, and does not require medium variables.

Compilation to Uber Data Project

  • Well, let's move on to the second bonus program, option. According to the description of the project, drivers will receive $ 4 for each trip if:
  • They complete at least 12 trip.

Have 4.7 or better ratio.

SELECT 
    COUNT(*) AS qualified_drivers,
    SUM("Trips Completed") * 4 AS total_payout
FROM data
WHERE "Trips Completed" >= 12
  AND Rating >= 4.7

In this case, instead of calculating the drivers, we need to add the amount of the labels they have completed from each of the bonus, not each.

Counting here tells us how many drivers are right. However, calculating a full-charge fee, we will count their journey and repeat it for $ 4, as required by Option 2.

Attack on DuckdB

Here's the outgoing.

Attack on DuckdB

With duckdb, we do not need to break lines or to form custom aggregation. SUM's job takes care of all we need.

3. Find extreme and differences using logic boolean

On SQL, you can easily mix conditions through using Loolean Logic, such as, or, not.

The use of boolean logic in Duckb vs Pandas

SELECT *
FROM data
WHERE condition_a
  AND condition_b
  AND NOT (condition_c)

DuckdB supports the Boolean Logic Natuel in the Uclause and, or, not.

filtered = df[
    (df["condition_a"]) &
    (df["condition_b"]) &
    ~(df["condition_c"])
]

PANDAS requires a combination of workers with a mask and parentheses, including the use of “~ ~” carelessness.

While both working, DuckdB is easy to imagine that when logic includes issuing or conditions eaten.

Boolean Logic of Uber Data project

Now we have calculated an option 1 and option 2, what follows? Now it's time to make comparisons. Remember our next question.

Boolean Logic Educkb

SELECT COUNT(*) AS only_option1
FROM data
WHERE "Supply Hours" >= 8
  AND CAST(REPLACE("Accept Rate", '%', '') AS DOUBLE) >= 90
  AND "Trips Completed" >= 10
  AND Rating >= 4.7
  AND NOT ("Trips Completed" >= 12 AND Rating >= 4.7)

This is where we can use logic boolean. We will use combinations and not.

Here's the outgoing.

Boolean Logic Educkb

  • Let's break:
  • The first four instances are here for the option 1.

Part (..) The part is used to remove unwanted drivers and ready for option 2.

Is right, of course?

4. Quick Cohort Cohort Accumulation on Conditional Empotures

Sometimes, you want to understand how big a group or cohort is between your data.

Use of conditional filters in Duckb vs Pandas?

SELECT 
  ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM data), 2) AS percentage
FROM data
WHERE condition_1
  AND condition_2
  AND condition_3

DuckdB is caused by cohort filters and a percentage count of one SQL, or including shows.

filtered = df[
    (df["condition_1"]) &
    (df["condition_2"]) &
    (df["condition_3"])
]
percentage = round(100.0 * len(filtered) / len(df), 2)

PANDAS needs to sort, count, and a hand category to calculate the percentage.

DuckdB here is clean and fast. Reduces the number of steps and avoids the repetitive code.

Cohort Sizing of Uber Data project

  • We are now in the last state of state 1
  • Completed under 10 trip
  • Have a lower consent amount than 90

Had a higher average than 4.7

SELECT 
  ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM data), 2) AS percentage
FROM data
WHERE "Trips Completed" < 10
  AND CAST(REPLACE("Accept Rate", '%', '') AS DOUBLE) = 4.7

Now, these are three different filters, and we want to count the percentage of each cellular drivers. Let's look at the question.

Here's the outgoing.

Cohort siing in duckdb

Here, we set lines in all three situations satisfied, calculated, and separated them with the total number of drivers to get the percentage.

5. Basic Revenue Modeling Questions

Now, imagine that you want basic stats. You can write talks directly on your choice of selection.

Arithmetic implementation in Duckb vs Pandas

SELECT 
    daily_income * work_days * weeks_per_year AS annual_revenue,
    weekly_cost * weeks_per_year AS total_cost,
    (daily_income * work_days * weeks_per_year) - (weekly_cost * weeks_per_year) AS net_income
FROM data

DuckdB allows statistics to be labeled directly to the selected clause as a calculator.

daily_income = 200
weeks_per_year = 49
work_days = 6
weekly_cost = 500

annual_revenue = daily_income * work_days * weeks_per_year
total_cost = weekly_cost * weeks_per_year
net_income = annual_revenue - total_cost

Pandas need between middle calculation between different differences of the same effect.

DuckdB simplify the Math LoGic into a readable SQL block, while the pandas are a bit available with convertible assignments.

Basic arithmetic in Uber data project

In Magu 2, Ber asked for the number of money (after cost) driver did annually except Uber team. Here are some costs such as gas, rent and insurance.

Basic Ariithmetic Duckdb

SELECT 
    200 * 6 * (52 - 3) AS annual_revenue,
    200 * (52 - 3) AS gas_expense,
    500 * (52 - 3) AS rent_expense,
    400 * 12 AS insurance_expense,
    (200 * 6 * (52 - 3)) 
      - (200 * (52 - 3) + 500 * (52 - 3) + 400 * 12) AS net_income

Now let's count the annual income and get costs out of it.

Here's the outgoing.

Basic Ariithmetic Duckdb

With duckdb, you can write this as a SQL Matrix Block. You don't need Pandas DataFrames or Looping of hands!

6. Statistics

What if your expense composition changes based on certain situations?

Conditional mathematical use in Duckb vs Pandas

SELECT 
    original_cost * 1.05 AS increased_cost,
    original_cost * 0.8 AS discounted_cost,
    0 AS removed_cost,
    (original_cost * 1.05 + original_cost * 0.8) AS total_new_cost

Duckdb allows you to use a conditional logic using arithmetic changes within your question.

weeks_worked = 49
gas = 200
insurance = 400

gas_expense = gas * 1.05 * weeks_worked
insurance_expense = insurance * 0.8 * 12
rent_expense = 0
total = gas_expense + insurance_expense

Pandas uses the same understanding of many mathematical lines and vocabous updates.

DuckdB curve what would be more reasonable in pandas on one SQL phrase.

Conditional statistics in Uber Data project

  • In this case, we now symbolize what happens when the drivers' teammates have also bought a car. The cost changes as
  • Gas costs rise in 5%
  • Insurance decreases by 20%
con.execute("""
SELECT 
    200 * 1.05 * 49 AS gas_expense,
    400 * 0.8 * 12 AS insurance_expense,
    0 AS rent_expense,
    (200 * 1.05 * 49) + (400 * 0.8 * 12) AS total_expense
""").fetchdf()

No more leasing costs

Here's the outgoing.

Conditional stats in duckdb

7. Statistics conducted by the intention of directing income

Sometimes, your analysis can be conducted by the business objective such as targeted target or covering costs.

Implementation of Mathematical Mathematical Mathematics in Duckb vs Pandas

WITH vars AS (
  SELECT base_income, cost_1, cost_2, target_item
),
calc AS (
  SELECT 
    base_income - (cost_1 + cost_2) AS current_profit,
    cost_1 * 1.1 + cost_2 * 0.8 + target_item AS new_total_expense
  FROM vars
),
final AS (
  SELECT 
    current_profit + new_total_expense AS required_revenue,
    required_revenue / 49 AS required_weekly_income
  FROM calc
)
SELECT required_weekly_income FROM final

DuckdB holds logic chat using CTES. Makes the question be converted and easy to learn.

weeks = 49
original_income = 200 * 6 * weeks
original_cost = (200 + 500) * weeks + 400 * 12
net_income = original_income - original_cost

# new expenses + car cost
new_gas = 200 * 1.05 * weeks
new_insurance = 400 * 0.8 * 12
car_cost = 40000

required_revenue = net_income + new_gas + new_insurance + car_cost
required_weekly_income = required_revenue / weeks

PANDAS requires a number of calculations and recycling of the former methods to avoid repeating.

DuckdB allows you to create a logic pipeline step by step, without entering your documentation by code.

Statistics conducted by purpose in Uber data project

Now that we have implement new expenses, let's answer the last business question:

  • How much driver does the driver do the money do?
  • Pay a car $ 40.000 within a year

Save the same income

WITH vars AS (
  SELECT 
    52 AS total_weeks_per_year,
    3 AS weeks_off,
    6 AS days_per_week,
    200 AS fare_per_day,
    400 AS monthly_insurance,
    200 AS gas_per_week,
    500 AS vehicle_rent,
    40000 AS car_cost
),
base AS (
  SELECT 
    total_weeks_per_year,
    weeks_off,
    days_per_week,
    fare_per_day,
    monthly_insurance,
    gas_per_week,
    vehicle_rent,
    car_cost,
    total_weeks_per_year - weeks_off AS weeks_worked,
    (fare_per_day * days_per_week * (total_weeks_per_year - weeks_off)) AS original_annual_revenue,
    (gas_per_week * (total_weeks_per_year - weeks_off)) AS original_gas,
    (vehicle_rent * (total_weeks_per_year - weeks_off)) AS original_rent,
    (monthly_insurance * 12) AS original_insurance
  FROM vars
),
compare AS (
  SELECT *,
    (original_gas + original_rent + original_insurance) AS original_total_expense,
    (original_annual_revenue - (original_gas + original_rent + original_insurance)) AS original_net_income
  FROM base
),
new_costs AS (
  SELECT *,
    gas_per_week * 1.05 * weeks_worked AS new_gas,
    monthly_insurance * 0.8 * 12 AS new_insurance
  FROM compare
),
final AS (
  SELECT *,
    new_gas + new_insurance + car_cost AS new_total_expense,
    original_net_income + new_gas + new_insurance + car_cost AS required_revenue,
    required_revenue / weeks_worked AS required_weekly_revenue,
    original_annual_revenue / weeks_worked AS original_weekly_revenue
  FROM new_costs
)
SELECT 
  ROUND(required_weekly_revenue, 2) AS required_weekly_revenue,
  ROUND(required_weekly_revenue - original_weekly_revenue, 2) AS weekly_uplift
FROM final

Now let's write the code that represents this option.

Here's the outgoing.

Statistics conducted by purpose in the Duckbb

The last thoughts

In this article, we examined how to connect with the duckdB and analyze information. Instead of using long pandas activities, we used SQL questions. We also did this using the real data project called for a scientific employment process.
Data scientists working in analytical – heavy jobs, it is a weak but powerful way of pandas. Try using it to your next project, especially when SQL logic fits better with the problem.

Nate Rosid

He is a data scientist and product plan. He is a person who is an educated educator, and the Founder of Stratascratch, a stage that helps data scientists prepare their conversations with the highest discussion of the chat. Nate writes the latest stylies in the work market, offers chat advice, sharing data science projects, and covered everything SQL.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button