Leveraging Pandas and SQL Together Auto Data Analysis

Photo for Author | KanelePANDAS and SQL both work on data analysis, but what if we can't cover their power? Reference pandasqlYou can write SQL questions directly within the Jobyter block. This combination of seams enables us to integrate SQL logic with Python working data analysis.
In this article, we will use pandas and SQL together to the data project from Uber. Let's get started!
Obvious What is pandaasql?
Pandasql can be combined with any data data through memory Sqlite The engine, so you can write a pure SQL within the Naton nature.
Obvious Benefits of using pandas and SQL together

SQL is useful to easily filter lines, combined data, or using multi-step logic.
Python, on the other hand, provides advanced tools for mathematical analysis and customizations, and the activities set up, which reaches more than SQL power.
When used together, SQL simplifies data selection, while Python add analysis.
Obvious How can you run Pandaasql within Jobyter Patebook?
For a run pandasql Inside the jophetersbook, start with the following code.
import pandas as pd
from pandasql import sqldf
run = lambda q: sqldf(q, globals())
Next, you can run your SQL code like this:
run("""
SELECT *
FROM df
LIMIT 10;
""")
We will use SQL code without showing run Work each time in this article.

Let's see how to use SQL and the combined panda works on the real project from Uber.
Obvious Real-World Project: Analyzing Uber Drieve Service Data


Photo by the writer
For this data project, Uber asks us whether we analyze driver's operating data and inspect the bonus strategies.
// To test data and analysis
Now, let's examine datasets. First, we will upload information.
// First data upload
Let's download dataset using just panda.
import pandas as pd
import numpy as np
df = pd.read_csv('dataset_2.csv')
// To View Details
Now let's review the data data.
Release looks like this:

We now have data indicators.
Since you can see, the dataset includes each driver's name, the amount of the completion of the completion, IE Number of Accepted Travel), hours of service providers (full hours used), and its normal estimate.
Let's assure column names before starting the data analysis to make good use.
Here's the outgoing.

As you can see, our dataset is five different columns, and no things are missing.
Now let's answer the questions using SQL and Ethnthon.
Obvious Question 1: Who is suitable for a bonus option 1?
In the first question, we are asked to determine the total number of optional billing 1, namely:
$ 50 Each Internet Driver at least 8 hours, welcomes 90% of the application, completing a 10-trip, and has 4.7 or better.
// Step 1: Sort of Qualified Drivers with SQL (Pandasql)
In this step, we will start to use pandasql.
In the following code, we have chosen all drivers who meet options for select 1 bonus using WHERE Paragraph and AND The operator is linking many situations. In order to learn to use WHERE including ANDrefer to these scriptures.
opt1_eligible = run("""
SELECT Name -- keep only a name column for clarity
FROM df
WHERE `Supply Hours` >= 8
AND `Trips Completed` >= 10
AND `Accept Rate` >= 90
AND Rating >= 4.7;
""")
opt1_eligible
Here's the outgoing.

// Step 2: Finishing in Panda
After the Dataset sorting using SQL with pandasqlWe switch to pandas to make prices and finish analysis. This method of hybrid, including SQL and Ethnthon, improves the reading and flexibility.
Next, using the following Python code, calculates the perfect pabout by multiplying the number of qualified drivers (using len()) With a $ 50 bonus each driver. Look up the scriptures to see how to use len() work.
payout_opt1 = 50 * len(opt1_eligible)
print(f"Option 1 payout: ${payout_opt1:,}")
Here's the outgoing.

Obvious Question 2: To calculate the full payment of Bonus Option 2
In the second question, we are asked to obtain a total of bonus using the 2 option:
$ 4 / trips to all drivers who complete the 12-year-old trip, and have 4.7 or better.
// Step 1: Sort of Qualified Drivers with SQL (Pandasql)
First, we use SQL to filter drivers who meet the process 2 Criteria: to complete the 12,7 -7 or more measure.
# Grab only the rows that satisfy the Option-2 thresholds
opt2_drivers = run("""
SELECT Name,
`Trips Completed`
FROM df
WHERE `Trips Completed` >= 12
AND Rating >= 4.7;
""")
opt2_drivers.head()
Here's what we get.

// Step 2: To complete the calculation of pending panda
Now let's do a count using pandas. The code includes a complete bonus by summarizing the Trips Completed column with sum() Then repeat the result of $ 4 bonus for each trip.
total_trips = opt2_drivers["Trips Completed"].sum()
option2_bonus = 4 * total_trips
print(f"Total trips: {total_trips}, Option-2 payout: ${option2_bonus}")
Here is the result.

Obvious Question 3: Definition the appropriate drivers 1 But you should not want 2
In the third question, we are asked to calculate the number of drivers ready for option 1 but not for option 2.
// Step 1: Creating two ql (Pandasql)
In the following SQL code, we create two datasets: One of the drivers meet the option 1 process and one of those who meet the process 2
# All Option-1 drivers
opt1_drivers = run("""
SELECT Name
FROM df
WHERE `Supply Hours` >= 8
AND `Trips Completed` >= 10
AND `Accept Rate` >= 90
AND Rating >= 4.7;
""")
# All Option-2 drivers
opt2_drivers = run("""
SELECT Name
FROM df
WHERE `Trips Completed` >= 12
AND Rating >= 4.7;
""")
// Step 2: Using Python Set Logic to see the difference
Next, we will use the Python to identify drivers from choosing 1 but not the 2 option, and we will use such functions.
Here's the code:
only_opt1 = set(opt1_drivers["Name"]) - set(opt2_drivers["Name"])
count_only_opt1 = len(only_opt1)
print(f"Drivers qualifying for Option 1 but not Option 2: {count_only_opt1}")
Here's the outgoing.

By combining these methods, we receive SQL filtering and Python text by comparing the resulting datasets.
Obvious Question 4: Finding drivers working with higher estimates
In questionnaire 4, we are asked to determine the percentage of the trip under 10 trips, had a reception area of less than 90%, and continue to maintain a 4.7 or more rating.
// Step 1: Draw Subset with SQL (Pandasql)
In the following code, we select all drivers who have completed a trip under 10, have a reception area of less than 90%, and hold the amount of at least 4.7.
low_kpi_df = run("""
SELECT *
FROM df
WHERE `Trips Completed` < 10
AND `Accept Rate` < 90
AND Rating >= 4.7;
""")
low_kpi_df
Here's the outgoing.

// Step 2: Counting percentage from obvious pandas
In this step, we will use the Python to calculate the percentage of such drivers.
We simply divide the number of drivers with the complete driver calculator, 100 percent to get a percentage.
Here's the code:
num_low_kpi = len(low_kpi_df)
total_drivers = len(df)
percentage = round(100 * num_low_kpi / total_drivers, 2)
print(f"{num_low_kpi} out of {total_drivers} drivers ⇒ {percentage}%")
Here's the outgoing.

Obvious Question 5: Counting annual profits without interaction with Uber
In the fifth question, we need to calculate the taxi driver's annual income without work with Uber, according to the costs provided and the cash flow parameters.
// Step 1: Drawing annual income and costs with SQL (Pandasql)
By using SQL, we first calculates annual income from the Daily Fares and generates gas costs, rent and insurance.
taxi_stats = run("""
SELECT
200*6*(52-3) AS annual_revenue,
((200+500)*(52-3) + 400*12) AS annual_expenses
""")
taxi_stats
Here's the outgoing.

// Step 2: Received Profit and Margin with Pandas
In the following step, we will use the Python to install the profit and line drivers receive if they have been involved with Uber.
rev = taxi_stats.loc[0, "annual_revenue"]
cost = taxi_stats.loc[0, "annual_expenses"]
profit = rev - cost
margin = round(100 * profit / rev, 2)
print(f"Revenue : ${rev:,}")
print(f"Expenses : ${cost:,}")
print(f"Profit : ${profit:,} (margin: {margin}%)")
Here's what we get.

Obvious Question 6: Calculation of money increases needed to maintain profits
In the sixth question, we think that the same driver decides to buy a city car and partner with Uber.
The cost of gas increased by 5%, the insurance dropped 20%, and the cost of hiring are completed, but the driver needs to cover $ 40,000 to the car costs. We are invited to calculate how much more weeks of driver's weeks should grow in both year.
// Step 1: Creating new Year's new expenses with SQL
In this step, we will use SQL to calculate new annual costs with repaired gas and insurance and no hiring costs, and the cost of the vehicle.
new_exp = run("""
SELECT
40000 AS car,
200*1.05*(52-3) AS gas, -- +5 %
400*0.80*12 AS insurance -- –20 %
""")
new_cost = new_exp.sum(axis=1).iloc[0]
new_cost
Here's the outgoing.

// Step 2: Counting the increase in PANDA
Next, we use Python to calculate how much the driver should get per week to keep that Margin after buying a car.
# Existing values from Question 5
old_rev = 58800
old_profit = 19700
old_margin = old_profit / old_rev
weeks = 49
# new_cost was calculated in the previous step (54130.0)
# We need to find the new revenue (new_rev) such that the profit margin remains the same:
# (new_rev - new_cost) / new_rev = old_margin
# Solving for new_rev gives: new_rev = new_cost / (1 - old_margin)
new_rev_required = new_cost / (1 - old_margin)
# The total increase in annual revenue needed is the difference
total_increase = new_rev_required - old_rev
# Divide by the number of working weeks to get the required weekly increase
weekly_bump = round(total_increase / weeks, 2)
print(f"Required weekly gross-fare increase = ${weekly_bump}")
Here's what we get.

Obvious Store
Integrating the Power of SQL and Python, primarily pandasqlWe have solved six different problems.
The SQL is helping to filter and summarize formal information, while the Python is fair in advanced integration and powerful deception.
Through all these analysis, provide both tools to facilitate work movement and make each step changed.
Nate Rosid He is a data scientist and product plan. He is a person who is an educated educator, and the Founder of Stratascratch, a stage that helps data scientists prepare their conversations with the highest discussion of the chat. Nate writes the latest stylies in the work market, offers chat advice, sharing data science projects, and covered everything SQL.



