The first guide to data analysis with Polars


Photo for Author | Ideogram
Obvious Introduction
When you are analytical to analyze with Python, Pings to the head Often that is what many analysts do and use. But Polarr You have become popular and fast and efficient.
Designed for rust, polars treating data processing activities that would reduce other tools. It's meant for speed, memory efficiency, and easily use. In this new article, we will drop details of visible coffee shopping and analyze to read polars. Sounds interesting? Let's get started!
π Link to the code in Githubub
Obvious Installing Polars
Before we enter the data analysis, let's get steps to put in the way. First, put polars:
! pip install polars numpy
Now, let's count on libraries and modules:
import polars as pl
import numpy as np
from datetime import datetime, timedelta
We use pl like a alias for polar.
Obvious To create a sample data
Imagine that you are carrying a small coffee store, says “beans there,” and have more receipts and related information. You want to understand what the most effective drinks, which days bring too much income, and related questions. So, let's start installing codes! β
Doing this guide work well, let him create a logical dataset of “Bean where coffee shop.” We will produce data that any small business owner can not know:
# Set up for consistent results
np.random.seed(42)
# Create realistic coffee shop data
def generate_coffee_data():
n_records = 2000
# Coffee menu items with realistic prices
menu_items = ['Espresso', 'Cappuccino', 'Latte', 'Americano', 'Mocha', 'Cold Brew']
prices = [2.50, 4.00, 4.50, 3.00, 5.00, 3.50]
price_map = dict(zip(menu_items, prices))
# Generate dates over 6 months
start_date = datetime(2023, 6, 1)
dates = [start_date + timedelta(days=np.random.randint(0, 180))
for _ in range(n_records)]
# Randomly select drinks, then map the correct price for each selected drink
drinks = np.random.choice(menu_items, n_records)
prices_chosen = [price_map[d] for d in drinks]
data = {
'date': dates,
'drink': drinks,
'price': prices_chosen,
'quantity': np.random.choice([1, 1, 1, 2, 2, 3], n_records),
'customer_type': np.random.choice(['Regular', 'New', 'Tourist'],
n_records, p=[0.5, 0.3, 0.2]),
'payment_method': np.random.choice(['Card', 'Cash', 'Mobile'],
n_records, p=[0.6, 0.2, 0.2]),
'rating': np.random.choice([2, 3, 4, 5], n_records, p=[0.1, 0.4, 0.4, 0.1])
}
return data
# Create our coffee shop DataFrame
coffee_data = generate_coffee_data()
df = pl.DataFrame(coffee_data)
This creates a sample dataset with 2,000 coffee transactions. Each line represents a single sale with such information and what is rated, when, how much, how much, cost, and who buys.
Obvious To check your data
Before analyzing any data, you need to understand what you work. Consider this as looking at the new recipe before you start cooking:
# Take a peek at your data
print("First 5 transactions:")
print(df.head())
print("nWhat types of data do we have?")
print(df.schema)
print("nHow big is our dataset?")
print(f"We have {df.height} transactions and {df.width} columns")
This page head() The way shows you the first few lines. The Schema tells you what kind of information is each column containing (numbers, text, days, etc.).
First 5 transactions:
shape: (5, 7)
βββββββββββββββββββββββ¬βββββββββββββ¬ββββββββ¬βββββββββββ¬ββββββββββββββββ¬βββββββββββββββββ¬βββββββββ
β date β drink β price β quantity β customer_type β payment_method β rating β
β --- β --- β --- β --- β --- β --- β --- β
β datetime[ΞΌs] β str β f64 β i64 β str β str β i64 β
βββββββββββββββββββββββͺβββββββββββββͺββββββββͺβββββββββββͺββββββββββββββββͺβββββββββββββββββͺβββββββββ‘
β 2023-09-11 00:00:00 β Cold Brew β 5.0 β 1 β New β Cash β 4 β
β 2023-11-27 00:00:00 β Cappuccino β 4.5 β 1 β New β Card β 4 β
β 2023-09-01 00:00:00 β Espresso β 4.5 β 1 β Regular β Card β 3 β
β 2023-06-15 00:00:00 β Cappuccino β 5.0 β 1 β New β Card β 4 β
β 2023-09-15 00:00:00 β Mocha β 5.0 β 2 β Regular β Card β 3 β
βββββββββββββββββββββββ΄βββββββββββββ΄ββββββββ΄βββββββββββ΄ββββββββββββββββ΄βββββββββββββββββ΄βββββββββ
What types of data do we have?
Schema({'date': Datetime(time_unit="us", time_zone=None), 'drink': String, 'price': Float64, 'quantity': Int64, 'customer_type': String, 'payment_method': String, 'rating': Int64})
How big is our dataset?
We have 2000 transactions and 7 columns
Obvious Adding new columns
Now let's start extracting business information. Every coffee shop owner wants to know the complete income in each transaction:
# Calculate total sales amount and add useful date information
df_enhanced = df.with_columns([
# Calculate revenue per transaction
(pl.col('price') * pl.col('quantity')).alias('total_sale'),
# Extract useful date components
pl.col('date').dt.weekday().alias('day_of_week'),
pl.col('date').dt.month().alias('month'),
pl.col('date').dt.hour().alias('hour_of_day')
])
print("Sample of enhanced data:")
print(df_enhanced.head())
Output (your exact numbers may vary):
Sample of enhanced data:
shape: (5, 11)
βββββββββββββββ¬βββββββββββββ¬ββββββββ¬βββββββββββ¬ββββ¬βββββββββββββ¬ββββββββββββββ¬ββββββββ¬ββββββββββββββ
β date β drink β price β quantity β β¦ β total_sale β day_of_week β month β hour_of_day β
β --- β --- β --- β --- β β --- β --- β --- β --- β
β datetime[ΞΌs β str β f64 β i64 β β f64 β i8 β i8 β i8 β
β ] β β β β β β β β β
βββββββββββββββͺβββββββββββββͺββββββββͺβββββββββββͺββββͺβββββββββββββͺββββββββββββββͺββββββββͺββββββββββββββ‘
β 2023-09-11 β Cold Brew β 5.0 β 1 β β¦ β 5.0 β 1 β 9 β 0 β
β 00:00:00 β β β β β β β β β
β 2023-11-27 β Cappuccino β 4.5 β 1 β β¦ β 4.5 β 1 β 11 β 0 β
β 00:00:00 β β β β β β β β β
β 2023-09-01 β Espresso β 4.5 β 1 β β¦ β 4.5 β 5 β 9 β 0 β
β 00:00:00 β β β β β β β β β
β 2023-06-15 β Cappuccino β 5.0 β 1 β β¦ β 5.0 β 4 β 6 β 0 β
β 00:00:00 β β β β β β β β β
β 2023-09-15 β Mocha β 5.0 β 2 β β¦ β 10.0 β 5 β 9 β 0 β
β 00:00:00 β β β β β β β β β
βββββββββββββββ΄βββββββββββββ΄ββββββββ΄βββββββββββ΄ββββ΄βββββββββββββ΄ββββββββββββββ΄ββββββββ΄ββββββββββββββ
Here's what is happening:
with_columns()Adds new columns to our datapl.col()Referring to existing columnalias()Gives our names the new descriptive believers- This page
dtAccessor issues parts of days (such as a month from a full day)
Consider this as adding counts to the spreadsheet. We do not change original data, just add more information to work.
Obvious The planning data
Now let's answer some interesting questions.
// Question 1: What are the beverages of our best sellers?
These groups of groups are all done in the type of drinking, and calculate the amount rates and per group. It is like filtering all your receipts in the pile of drinking, and calculates the amount of each letter.
drink_performance = (df_enhanced
.group_by('drink')
.agg([
pl.col('total_sale').sum().alias('total_revenue'),
pl.col('quantity').sum().alias('total_sold'),
pl.col('rating').mean().alias('avg_rating')
])
.sort('total_revenue', descending=True)
)
print("Drink performance ranking:")
print(drink_performance)
Which is output:
Drink performance ranking:
shape: (6, 4)
ββββββββββββββ¬ββββββββββββββββ¬βββββββββββββ¬βββββββββββββ
β drink β total_revenue β total_sold β avg_rating β
β --- β --- β --- β --- β
β str β f64 β i64 β f64 β
ββββββββββββββͺββββββββββββββββͺβββββββββββββͺβββββββββββββ‘
β Americano β 2242.0 β 595 β 3.476454 β
β Mocha β 2204.0 β 591 β 3.492711 β
β Espresso β 2119.5 β 570 β 3.514793 β
β Cold Brew β 2035.5 β 556 β 3.475758 β
β Cappuccino β 1962.5 β 521 β 3.541139 β
β Latte β 1949.5 β 514 β 3.528846 β
ββββββββββββββ΄ββββββββββββββββ΄βββββββββββββ΄βββββββββββββ
// Question 2: What does everyday look like?
Let us now find the amount of transaction and compatible income of each day of the week.
daily_patterns = (df_enhanced
.group_by('day_of_week')
.agg([
pl.col('total_sale').sum().alias('daily_revenue'),
pl.len().alias('number_of_transactions')
])
.sort('day_of_week')
)
print("Daily business patterns:")
print(daily_patterns)
Which is output:
Daily business patterns:
shape: (7, 3)
βββββββββββββββ¬ββββββββββββββββ¬βββββββββββββββββββββββββ
β day_of_week β daily_revenue β number_of_transactions β
β --- β --- β --- β
β i8 β f64 β u32 β
βββββββββββββββͺββββββββββββββββͺβββββββββββββββββββββββββ‘
β 1 β 2061.0 β 324 β
β 2 β 1761.0 β 276 β
β 3 β 1710.0 β 278 β
β 4 β 1784.0 β 288 β
β 5 β 1651.5 β 265 β
β 6 β 1596.0 β 259 β
β 7 β 1949.5 β 310 β
βββββββββββββββ΄ββββββββββββββββ΄βββββββββββββββββββββββββ
Obvious Sorting data
Let's get our highest transaction:
# Find transactions over $10 (multiple items or expensive drinks)
big_orders = (df_enhanced
.filter(pl.col('total_sale') > 10.0)
.sort('total_sale', descending=True)
)
print(f"We have {big_orders.height} orders over $10")
print("Top 5 biggest orders:")
print(big_orders.head())
Which is output:
We have 204 orders over $10
Top 5 biggest orders:
shape: (5, 11)
βββββββββββββββ¬βββββββββββββ¬ββββββββ¬βββββββββββ¬ββββ¬βββββββββββββ¬ββββββββββββββ¬ββββββββ¬ββββββββββββββ
β date β drink β price β quantity β β¦ β total_sale β day_of_week β month β hour_of_day β
β --- β --- β --- β --- β β --- β --- β --- β --- β
β datetime[ΞΌs β str β f64 β i64 β β f64 β i8 β i8 β i8 β
β ] β β β β β β β β β
βββββββββββββββͺβββββββββββββͺββββββββͺβββββββββββͺββββͺβββββββββββββͺββββββββββββββͺββββββββͺββββββββββββββ‘
β 2023-07-21 β Cappuccino β 5.0 β 3 β β¦ β 15.0 β 5 β 7 β 0 β
β 00:00:00 β β β β β β β β β
β 2023-08-02 β Latte β 5.0 β 3 β β¦ β 15.0 β 3 β 8 β 0 β
β 00:00:00 β β β β β β β β β
β 2023-07-21 β Cappuccino β 5.0 β 3 β β¦ β 15.0 β 5 β 7 β 0 β
β 00:00:00 β β β β β β β β β
β 2023-10-08 β Cappuccino β 5.0 β 3 β β¦ β 15.0 β 7 β 10 β 0 β
β 00:00:00 β β β β β β β β β
β 2023-09-07 β Latte β 5.0 β 3 β β¦ β 15.0 β 4 β 9 β 0 β
β 00:00:00 β β β β β β β β β
βββββββββββββββ΄βββββββββββββ΄ββββββββ΄βββββββββββ΄ββββ΄βββββββββββββ΄ββββββββββββββ΄ββββββββ΄ββββββββββββββ
Obvious Analyzing customer behavior
Let's look at customer patterns:
# Analyze customer behavior by type
customer_analysis = (df_enhanced
.group_by('customer_type')
.agg([
pl.col('total_sale').mean().alias('avg_spending'),
pl.col('total_sale').sum().alias('total_revenue'),
pl.len().alias('visit_count'),
pl.col('rating').mean().alias('avg_satisfaction')
])
.with_columns([
# Calculate revenue per visit
(pl.col('total_revenue') / pl.col('visit_count')).alias('revenue_per_visit')
])
)
print("Customer behavior analysis:")
print(customer_analysis)
Which is output:
Customer behavior analysis:
shape: (3, 6)
βββββββββββββββββ¬βββββββββββββββ¬ββββββββββββββββ¬ββββββββββββββ¬βββββββββββββββββββ¬βββββββββββββββββββ
β customer_type β avg_spending β total_revenue β visit_count β avg_satisfaction β revenue_per_visi β
β --- β --- β --- β --- β --- β t β
β str β f64 β f64 β u32 β f64 β --- β
β β β β β β f64 β
βββββββββββββββββͺβββββββββββββββͺββββββββββββββββͺββββββββββββββͺβββββββββββββββββββͺβββββββββββββββββββ‘
β Regular β 6.277832 β 6428.5 β 1024 β 3.499023 β 6.277832 β
β Tourist β 6.185185 β 2505.0 β 405 β 3.518519 β 6.185185 β
β New β 6.268827 β 3579.5 β 571 β 3.502627 β 6.268827 β
βββββββββββββββββ΄βββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββββββββ
Obvious To put everything together
Let's build a complete business abbreviation:
# Create a complete business summary
business_summary = {
'total_revenue': df_enhanced['total_sale'].sum(),
'total_transactions': df_enhanced.height,
'average_transaction': df_enhanced['total_sale'].mean(),
'best_selling_drink': drink_performance.row(0)[0], # First row, first column
'customer_satisfaction': df_enhanced['rating'].mean()
}
print("n=== BEAN THERE COFFEE SHOP - SUMMARY ===")
for key, value in business_summary.items():
if isinstance(value, float) and key != 'customer_satisfaction':
print(f"{key.replace('_', ' ').title()}: ${value:.2f}")
else:
print(f"{key.replace('_', ' ').title()}: {value}")
Which is output:
=== BEAN THERE COFFEE SHOP - SUMMARY ===
Total Revenue: $12513.00
Total Transactions: 2000
Average Transaction: $6.26
Best Selling Drink: Americano
Customer Satisfaction: 3.504
Obvious Store
You just completed the wide introduction of data analysis and polars! Using our coffee shop examin, (I hope) learn how to change the green purchase into logical business information.
Remember, having the ability to a data analysis is like reading cooking – starts in basic ways (as examples in this guide) and gradually they are better. The key is acting and you want to know.
In the future when analyzing dataset, ask yourself:
- What does this data talk about?
- What methods can be hid here?
- What questions can answer this data?
Then use your new polar skills to find. Happy Assessment!
Count Priya c He is the writer and a technical writer from India. He likes to work in mathematical communication, data science and content creation. His areas of interest and professionals includes deliefs, data science and natural language. She enjoys reading, writing, coding, and coffee! Currently, he works by reading and sharing his knowledge and engineering society by disciples of teaching, how they guide, pieces of ideas, and more. Calculate and create views of the resources and instruction of codes.



