ANI

Leveraging Pandas and SQL Together Auto Data Analysis

What is pandaasqlPhoto for Author | Kanele

PANDAS and SQL both work on data analysis, but what if we can't cover their power? Reference pandasqlYou can write SQL questions directly within the Jobyter block. This combination of seams enables us to integrate SQL logic with Python working data analysis.

In this article, we will use pandas and SQL together to the data project from Uber. Let's get started!

Obvious What is pandaasql?

Pandasql can be combined with any data data through memory Sqlite The engine, so you can write a pure SQL within the Naton nature.

Obvious Benefits of using pandas and SQL together

Benefits of using pandas and SQL togetherBenefits of using pandas and SQL together

SQL is useful to easily filter lines, combined data, or using multi-step logic.
Python, on the other hand, provides advanced tools for mathematical analysis and customizations, and the activities set up, which reaches more than SQL power.
When used together, SQL simplifies data selection, while Python add analysis.

Obvious How can you run Pandaasql within Jobyter Patebook?

For a run pandasql Inside the jophetersbook, start with the following code.

import pandas as pd
from pandasql import sqldf
run = lambda q: sqldf(q, globals())

Next, you can run your SQL code like this:

run("""
SELECT *
FROM df
LIMIT 10;
""")

We will use SQL code without showing run Work each time in this article.

How can you run the pandasql within Jobyter Patebook?How can you run the pandasql within Jobyter Patebook?

Let's see how to use SQL and the combined panda works on the real project from Uber.

Obvious Real-World Project: Analyzing Uber Drieve Service Data

Real-World Project: Analyzing Uber Drieve Service DataReal-World Project: Analyzing Uber Drieve Service Data
Photo by the writer

For this data project, Uber asks us whether we analyze driver's operating data and inspect the bonus strategies.

// To test data and analysis

Now, let's examine datasets. First, we will upload information.

// First data upload

Let's download dataset using just panda.

import pandas as pd
import numpy as np
df = pd.read_csv('dataset_2.csv')

// To View Details

Now let's review the data data.

Release looks like this:

To test data and analysisTo test data and analysis

We now have data indicators.
Since you can see, the dataset includes each driver's name, the amount of the completion of the completion, IE Number of Accepted Travel), hours of service providers (full hours used), and its normal estimate.
Let's assure column names before starting the data analysis to make good use.

Here's the outgoing.

To test data and analysisTo test data and analysis

As you can see, our dataset is five different columns, and no things are missing.
Now let's answer the questions using SQL and Ethnthon.

Obvious Question 1: Who is suitable for a bonus option 1?

In the first question, we are asked to determine the total number of optional billing 1, namely:

$ 50 Each Internet Driver at least 8 hours, welcomes 90% of the application, completing a 10-trip, and has 4.7 or better.

// Step 1: Sort of Qualified Drivers with SQL (Pandasql)

In this step, we will start to use pandasql.

In the following code, we have chosen all drivers who meet options for select 1 bonus using WHERE Paragraph and AND The operator is linking many situations. In order to learn to use WHERE including ANDrefer to these scriptures.

opt1_eligible = run("""
    SELECT Name                -- keep only a name column for clarity
    FROM   df
    WHERE  `Supply Hours`    >=  8
      AND  `Trips Completed` >= 10
      AND  `Accept Rate`     >= 90
      AND  Rating            >= 4.7;
""")
opt1_eligible

Here's the outgoing.

Dining indicating suitable drivers option 1Dining indicating suitable drivers option 1

// Step 2: Finishing in Panda

After the Dataset sorting using SQL with pandasqlWe switch to pandas to make prices and finish analysis. This method of hybrid, including SQL and Ethnthon, improves the reading and flexibility.

Next, using the following Python code, calculates the perfect pabout by multiplying the number of qualified drivers (using len()) With a $ 50 bonus each driver. Look up the scriptures to see how to use len() work.

payout_opt1 = 50 * len(opt1_eligible)
print(f"Option 1 payout: ${payout_opt1:,}")

Here's the outgoing.

Finish on pandaFinish on panda

Obvious Question 2: To calculate the full payment of Bonus Option 2

In the second question, we are asked to obtain a total of bonus using the 2 option:

$ 4 / trips to all drivers who complete the 12-year-old trip, and have 4.7 or better.

// Step 1: Sort of Qualified Drivers with SQL (Pandasql)

First, we use SQL to filter drivers who meet the process 2 Criteria: to complete the 12,7 -7 or more measure.

# Grab only the rows that satisfy the Option-2 thresholds
opt2_drivers = run("""
    SELECT Name,
           `Trips Completed`
    FROM   df
    WHERE  `Trips Completed` >= 12
      AND  Rating            >= 4.7;
""")
opt2_drivers.head()

Here's what we get.

Sort Qualified Drivers with SQL (Pandasql)Sort Qualified Drivers with SQL (Pandasql)

// Step 2: To complete the calculation of pending panda

Now let's do a count using pandas. The code includes a complete bonus by summarizing the Trips Completed column with sum() Then repeat the result of $ 4 bonus for each trip.

total_trips   = opt2_drivers["Trips Completed"].sum()
option2_bonus = 4 * total_trips
print(f"Total trips: {total_trips},  Option-2 payout: ${option2_bonus}")

Here is the result.

Complete the calculation of pending pandaComplete the calculation of pending panda

Obvious Question 3: Definition the appropriate drivers 1 But you should not want 2

In the third question, we are asked to calculate the number of drivers ready for option 1 but not for option 2.

// Step 1: Creating two ql (Pandasql)

In the following SQL code, we create two datasets: One of the drivers meet the option 1 process and one of those who meet the process 2

# All Option-1 drivers
opt1_drivers = run("""
    SELECT Name
    FROM   df
    WHERE  `Supply Hours`    >=  8
      AND  `Trips Completed` >= 10
      AND  `Accept Rate`     >= 90
      AND  Rating            >= 4.7;
""")

# All Option-2 drivers
opt2_drivers = run("""
    SELECT Name
    FROM   df
    WHERE  `Trips Completed` >= 12
      AND  Rating            >= 4.7;
""")

// Step 2: Using Python Set Logic to see the difference

Next, we will use the Python to identify drivers from choosing 1 but not the 2 option, and we will use such functions.

Here's the code:

only_opt1 = set(opt1_drivers["Name"]) - set(opt2_drivers["Name"])
count_only_opt1 = len(only_opt1)

print(f"Drivers qualifying for Option 1 but not Option 2: {count_only_opt1}")

Here's the outgoing.

Use Python Set Logic to see the differenceUse Python Set Logic to see the difference

By combining these methods, we receive SQL filtering and Python text by comparing the resulting datasets.

Obvious Question 4: Finding drivers working with higher estimates

In questionnaire 4, we are asked to determine the percentage of the trip under 10 trips, had a reception area of less than 90%, and continue to maintain a 4.7 or more rating.

// Step 1: Draw Subset with SQL (Pandasql)

In the following code, we select all drivers who have completed a trip under 10, have a reception area of less than 90%, and hold the amount of at least 4.7.

low_kpi_df = run("""
    SELECT *
    FROM   df
    WHERE  `Trips Completed` < 10
      AND  `Accept Rate`     < 90
      AND  Rating            >= 4.7;
""")
low_kpi_df

Here's the outgoing.

Pull Subset with SQL (Pandasql)Pull Subset with SQL (Pandasql)

// Step 2: Counting percentage from obvious pandas

In this step, we will use the Python to calculate the percentage of such drivers.

We simply divide the number of drivers with the complete driver calculator, 100 percent to get a percentage.

Here's the code:

num_low_kpi   = len(low_kpi_df)
total_drivers = len(df)
percentage    = round(100 * num_low_kpi / total_drivers, 2)

print(f"{num_low_kpi} out of {total_drivers} drivers ⇒ {percentage}%")

Here's the outgoing.

Calculate Percent Percentage for Obvious PandasCalculate Percent Percentage for Obvious Pandas

Obvious Question 5: Counting annual profits without interaction with Uber

In the fifth question, we need to calculate the taxi driver's annual income without work with Uber, according to the costs provided and the cash flow parameters.

// Step 1: Drawing annual income and costs with SQL (Pandasql)

By using SQL, we first calculates annual income from the Daily Fares and generates gas costs, rent and insurance.

taxi_stats = run("""
SELECT
    200*6*(52-3)                      AS annual_revenue,
    ((200+500)*(52-3) + 400*12)       AS annual_expenses
""")
taxi_stats

Here's the outgoing.

To pull annual income and annual costs by SQL (Pandasql)To pull annual income and annual costs by SQL (Pandasql)

// Step 2: Received Profit and Margin with Pandas

In the following step, we will use the Python to install the profit and line drivers receive if they have been involved with Uber.

rev  = taxi_stats.loc[0, "annual_revenue"]
cost = taxi_stats.loc[0, "annual_expenses"]

profit  = rev - cost
margin  = round(100 * profit / rev, 2)

print(f"Revenue  : ${rev:,}")
print(f"Expenses : ${cost:,}")
print(f"Profit   : ${profit:,}    (margin: {margin}%)")

Here's what we get.

Pandas earn profits & margin from those SQL numbersPandas earn profits & margin from those SQL numbers

Obvious Question 6: Calculation of money increases needed to maintain profits

In the sixth question, we think that the same driver decides to buy a city car and partner with Uber.

The cost of gas increased by 5%, the insurance dropped 20%, and the cost of hiring are completed, but the driver needs to cover $ 40,000 to the car costs. We are invited to calculate how much more weeks of driver's weeks should grow in both year.

// Step 1: Creating new Year's new expenses with SQL

In this step, we will use SQL to calculate new annual costs with repaired gas and insurance and no hiring costs, and the cost of the vehicle.

new_exp = run("""
SELECT
    40000             AS car,
    200*1.05*(52-3)   AS gas,        -- +5 %
    400*0.80*12       AS insurance   -- –20 %
""")
new_cost = new_exp.sum(axis=1).iloc[0]
new_cost

Here's the outgoing.

SQL builds a new one-year stackSQL builds a new one-year stack

// Step 2: Counting the increase in PANDA

Next, we use Python to calculate how much the driver should get per week to keep that Margin after buying a car.

# Existing values from Question 5
old_rev    = 58800
old_profit = 19700
old_margin = old_profit / old_rev
weeks      = 49

# new_cost was calculated in the previous step (54130.0)

# We need to find the new revenue (new_rev) such that the profit margin remains the same:
# (new_rev - new_cost) / new_rev = old_margin
# Solving for new_rev gives: new_rev = new_cost / (1 - old_margin)
new_rev_required = new_cost / (1 - old_margin)

# The total increase in annual revenue needed is the difference
total_increase = new_rev_required - old_rev

# Divide by the number of working weeks to get the required weekly increase
weekly_bump = round(total_increase / weeks, 2)

print(f"Required weekly gross-fare increase = ${weekly_bump}")

Here's what we get.

Pandas uses old profit-margin & algebra to get a week's bumpPandas uses old profit-margin & algebra to get a week's bump

Obvious Store

Integrating the Power of SQL and Python, primarily pandasqlWe have solved six different problems.

The SQL is helping to filter and summarize formal information, while the Python is fair in advanced integration and powerful deception.

Through all these analysis, provide both tools to facilitate work movement and make each step changed.

Nate Rosid He is a data scientist and product plan. He is a person who is an educated educator, and the Founder of Stratascratch, a stage that helps data scientists prepare their conversations with the highest discussion of the chat. Nate writes the latest stylies in the work market, offers chat advice, sharing data science projects, and covered everything SQL.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button