ANI

The first guide to data analysis with Polars

nimda September 19, 2025

0 6 8 minutes read

The first guide to data analysis with Polars

Photo for Author | Ideogram

Obvious Introduction

When you are analytical to analyze with Python, Pings to the head Often that is what many analysts do and use. But Polarr You have become popular and fast and efficient.

Designed for rust, polars treating data processing activities that would reduce other tools. It's meant for speed, memory efficiency, and easily use. In this new article, we will drop details of visible coffee shopping and analyze to read polars. Sounds interesting? Let's get started!

🔗 Link to the code in Githubub

Obvious Installing Polars

Before we enter the data analysis, let's get steps to put in the way. First, put polars:

! pip install polars numpy

Now, let's count on libraries and modules:

import polars as pl
import numpy as np
from datetime import datetime, timedelta

We use pl like a alias for polar.

Obvious To create a sample data

Imagine that you are carrying a small coffee store, says “beans there,” and have more receipts and related information. You want to understand what the most effective drinks, which days bring too much income, and related questions. So, let's start installing codes! ☕

Doing this guide work well, let him create a logical dataset of “Bean where coffee shop.” We will produce data that any small business owner can not know:

# Set up for consistent results
np.random.seed(42)

# Create realistic coffee shop data
def generate_coffee_data():
    n_records = 2000
    # Coffee menu items with realistic prices
    menu_items = ['Espresso', 'Cappuccino', 'Latte', 'Americano', 'Mocha', 'Cold Brew']
    prices = [2.50, 4.00, 4.50, 3.00, 5.00, 3.50]
    price_map = dict(zip(menu_items, prices))

    # Generate dates over 6 months
    start_date = datetime(2023, 6, 1)
    dates = [start_date + timedelta(days=np.random.randint(0, 180))
             for _ in range(n_records)]

    # Randomly select drinks, then map the correct price for each selected drink
    drinks = np.random.choice(menu_items, n_records)
    prices_chosen = [price_map[d] for d in drinks]

    data = {
        'date': dates,
        'drink': drinks,
        'price': prices_chosen,
        'quantity': np.random.choice([1, 1, 1, 2, 2, 3], n_records),
        'customer_type': np.random.choice(['Regular', 'New', 'Tourist'],
                                          n_records, p=[0.5, 0.3, 0.2]),
        'payment_method': np.random.choice(['Card', 'Cash', 'Mobile'],
                                           n_records, p=[0.6, 0.2, 0.2]),
        'rating': np.random.choice([2, 3, 4, 5], n_records, p=[0.1, 0.4, 0.4, 0.1])
    }
    return data

# Create our coffee shop DataFrame
coffee_data = generate_coffee_data()
df = pl.DataFrame(coffee_data)

This creates a sample dataset with 2,000 coffee transactions. Each line represents a single sale with such information and what is rated, when, how much, how much, cost, and who buys.

Obvious To check your data

Before analyzing any data, you need to understand what you work. Consider this as looking at the new recipe before you start cooking:

# Take a peek at your data
print("First 5 transactions:")
print(df.head())

print("nWhat types of data do we have?")
print(df.schema)

print("nHow big is our dataset?")
print(f"We have {df.height} transactions and {df.width} columns")

This page head() The way shows you the first few lines. The Schema tells you what kind of information is each column containing (numbers, text, days, etc.).

First 5 transactions:
shape: (5, 7)
┌─────────────────────┬────────────┬───────┬──────────┬───────────────┬────────────────┬────────┐
│ date                ┆ drink      ┆ price ┆ quantity ┆ customer_type ┆ payment_method ┆ rating │
│ ---                 ┆ ---        ┆ ---   ┆ ---      ┆ ---           ┆ ---            ┆ ---    │
│ datetime[μs]        ┆ str        ┆ f64   ┆ i64      ┆ str           ┆ str            ┆ i64    │
╞═════════════════════╪════════════╪═══════╪══════════╪═══════════════╪════════════════╪════════╡
│ 2023-09-11 00:00:00 ┆ Cold Brew  ┆ 5.0   ┆ 1        ┆ New           ┆ Cash           ┆ 4      │
│ 2023-11-27 00:00:00 ┆ Cappuccino ┆ 4.5   ┆ 1        ┆ New           ┆ Card           ┆ 4      │
│ 2023-09-01 00:00:00 ┆ Espresso   ┆ 4.5   ┆ 1        ┆ Regular       ┆ Card           ┆ 3      │
│ 2023-06-15 00:00:00 ┆ Cappuccino ┆ 5.0   ┆ 1        ┆ New           ┆ Card           ┆ 4      │
│ 2023-09-15 00:00:00 ┆ Mocha      ┆ 5.0   ┆ 2        ┆ Regular       ┆ Card           ┆ 3      │
└─────────────────────┴────────────┴───────┴──────────┴───────────────┴────────────────┴────────┘

What types of data do we have?
Schema({'date': Datetime(time_unit="us", time_zone=None), 'drink': String, 'price': Float64, 'quantity': Int64, 'customer_type': String, 'payment_method': String, 'rating': Int64})

How big is our dataset?
We have 2000 transactions and 7 columns

Obvious Adding new columns

Now let's start extracting business information. Every coffee shop owner wants to know the complete income in each transaction:

# Calculate total sales amount and add useful date information
df_enhanced = df.with_columns([
    # Calculate revenue per transaction
    (pl.col('price') * pl.col('quantity')).alias('total_sale'),

    # Extract useful date components
    pl.col('date').dt.weekday().alias('day_of_week'),
    pl.col('date').dt.month().alias('month'),
    pl.col('date').dt.hour().alias('hour_of_day')
])

print("Sample of enhanced data:")
print(df_enhanced.head())

Output (your exact numbers may vary):

Sample of enhanced data:
shape: (5, 11)
┌─────────────┬────────────┬───────┬──────────┬───┬────────────┬─────────────┬───────┬─────────────┐
│ date        ┆ drink      ┆ price ┆ quantity ┆ … ┆ total_sale ┆ day_of_week ┆ month ┆ hour_of_day │
│ ---         ┆ ---        ┆ ---   ┆ ---      ┆   ┆ ---        ┆ ---         ┆ ---   ┆ ---         │
│ datetime[μs ┆ str        ┆ f64   ┆ i64      ┆   ┆ f64        ┆ i8          ┆ i8    ┆ i8          │
│ ]           ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
╞═════════════╪════════════╪═══════╪══════════╪═══╪════════════╪═════════════╪═══════╪═════════════╡
│ 2023-09-11  ┆ Cold Brew  ┆ 5.0   ┆ 1        ┆ … ┆ 5.0        ┆ 1           ┆ 9     ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
│ 2023-11-27  ┆ Cappuccino ┆ 4.5   ┆ 1        ┆ … ┆ 4.5        ┆ 1           ┆ 11    ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
│ 2023-09-01  ┆ Espresso   ┆ 4.5   ┆ 1        ┆ … ┆ 4.5        ┆ 5           ┆ 9     ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
│ 2023-06-15  ┆ Cappuccino ┆ 5.0   ┆ 1        ┆ … ┆ 5.0        ┆ 4           ┆ 6     ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
│ 2023-09-15  ┆ Mocha      ┆ 5.0   ┆ 2        ┆ … ┆ 10.0       ┆ 5           ┆ 9     ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
└─────────────┴────────────┴───────┴──────────┴───┴────────────┴─────────────┴───────┴─────────────┘

Here's what is happening:

with_columns() Adds new columns to our data
pl.col() Referring to existing column
alias() Gives our names the new descriptive believers
This page dt Accessor issues parts of days (such as a month from a full day)

Consider this as adding counts to the spreadsheet. We do not change original data, just add more information to work.

Obvious The planning data

Now let's answer some interesting questions.

// Question 1: What are the beverages of our best sellers?

These groups of groups are all done in the type of drinking, and calculate the amount rates and per group. It is like filtering all your receipts in the pile of drinking, and calculates the amount of each letter.

drink_performance = (df_enhanced
    .group_by('drink')
    .agg([
        pl.col('total_sale').sum().alias('total_revenue'),
        pl.col('quantity').sum().alias('total_sold'),
        pl.col('rating').mean().alias('avg_rating')
    ])
    .sort('total_revenue', descending=True)
)

print("Drink performance ranking:")
print(drink_performance)

Which is output:

Drink performance ranking:
shape: (6, 4)
┌────────────┬───────────────┬────────────┬────────────┐
│ drink      ┆ total_revenue ┆ total_sold ┆ avg_rating │
│ ---        ┆ ---           ┆ ---        ┆ ---        │
│ str        ┆ f64           ┆ i64        ┆ f64        │
╞════════════╪═══════════════╪════════════╪════════════╡
│ Americano  ┆ 2242.0        ┆ 595        ┆ 3.476454   │
│ Mocha      ┆ 2204.0        ┆ 591        ┆ 3.492711   │
│ Espresso   ┆ 2119.5        ┆ 570        ┆ 3.514793   │
│ Cold Brew  ┆ 2035.5        ┆ 556        ┆ 3.475758   │
│ Cappuccino ┆ 1962.5        ┆ 521        ┆ 3.541139   │
│ Latte      ┆ 1949.5        ┆ 514        ┆ 3.528846   │
└────────────┴───────────────┴────────────┴────────────┘

// Question 2: What does everyday look like?

Let us now find the amount of transaction and compatible income of each day of the week.

daily_patterns = (df_enhanced
    .group_by('day_of_week')
    .agg([
        pl.col('total_sale').sum().alias('daily_revenue'),
        pl.len().alias('number_of_transactions')
    ])
    .sort('day_of_week')
)

print("Daily business patterns:")
print(daily_patterns)

Which is output:

Daily business patterns:
shape: (7, 3)
┌─────────────┬───────────────┬────────────────────────┐
│ day_of_week ┆ daily_revenue ┆ number_of_transactions │
│ ---         ┆ ---           ┆ ---                    │
│ i8          ┆ f64           ┆ u32                    │
╞═════════════╪═══════════════╪════════════════════════╡
│ 1           ┆ 2061.0        ┆ 324                    │
│ 2           ┆ 1761.0        ┆ 276                    │
│ 3           ┆ 1710.0        ┆ 278                    │
│ 4           ┆ 1784.0        ┆ 288                    │
│ 5           ┆ 1651.5        ┆ 265                    │
│ 6           ┆ 1596.0        ┆ 259                    │
│ 7           ┆ 1949.5        ┆ 310                    │
└─────────────┴───────────────┴────────────────────────┘

Obvious Sorting data

Let's get our highest transaction:

# Find transactions over $10 (multiple items or expensive drinks)
big_orders = (df_enhanced
    .filter(pl.col('total_sale') > 10.0)
    .sort('total_sale', descending=True)
)

print(f"We have {big_orders.height} orders over $10")
print("Top 5 biggest orders:")
print(big_orders.head())

Which is output:

We have 204 orders over $10
Top 5 biggest orders:
shape: (5, 11)
┌─────────────┬────────────┬───────┬──────────┬───┬────────────┬─────────────┬───────┬─────────────┐
│ date        ┆ drink      ┆ price ┆ quantity ┆ … ┆ total_sale ┆ day_of_week ┆ month ┆ hour_of_day │
│ ---         ┆ ---        ┆ ---   ┆ ---      ┆   ┆ ---        ┆ ---         ┆ ---   ┆ ---         │
│ datetime[μs ┆ str        ┆ f64   ┆ i64      ┆   ┆ f64        ┆ i8          ┆ i8    ┆ i8          │
│ ]           ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
╞═════════════╪════════════╪═══════╪══════════╪═══╪════════════╪═════════════╪═══════╪═════════════╡
│ 2023-07-21  ┆ Cappuccino ┆ 5.0   ┆ 3        ┆ … ┆ 15.0       ┆ 5           ┆ 7     ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
│ 2023-08-02  ┆ Latte      ┆ 5.0   ┆ 3        ┆ … ┆ 15.0       ┆ 3           ┆ 8     ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
│ 2023-07-21  ┆ Cappuccino ┆ 5.0   ┆ 3        ┆ … ┆ 15.0       ┆ 5           ┆ 7     ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
│ 2023-10-08  ┆ Cappuccino ┆ 5.0   ┆ 3        ┆ … ┆ 15.0       ┆ 7           ┆ 10    ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
│ 2023-09-07  ┆ Latte      ┆ 5.0   ┆ 3        ┆ … ┆ 15.0       ┆ 4           ┆ 9     ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
└─────────────┴────────────┴───────┴──────────┴───┴────────────┴─────────────┴───────┴─────────────┘

Obvious Analyzing customer behavior

Let's look at customer patterns:

# Analyze customer behavior by type
customer_analysis = (df_enhanced
    .group_by('customer_type')
    .agg([
        pl.col('total_sale').mean().alias('avg_spending'),
        pl.col('total_sale').sum().alias('total_revenue'),
        pl.len().alias('visit_count'),
        pl.col('rating').mean().alias('avg_satisfaction')
    ])
    .with_columns([
        # Calculate revenue per visit
        (pl.col('total_revenue') / pl.col('visit_count')).alias('revenue_per_visit')
    ])
)

print("Customer behavior analysis:")
print(customer_analysis)

Which is output:

Customer behavior analysis:
shape: (3, 6)
┌───────────────┬──────────────┬───────────────┬─────────────┬──────────────────┬──────────────────┐
│ customer_type ┆ avg_spending ┆ total_revenue ┆ visit_count ┆ avg_satisfaction ┆ revenue_per_visi │
│ ---           ┆ ---          ┆ ---           ┆ ---         ┆ ---              ┆ t                │
│ str           ┆ f64          ┆ f64           ┆ u32         ┆ f64              ┆ ---              │
│               ┆              ┆               ┆             ┆                  ┆ f64              │
╞═══════════════╪══════════════╪═══════════════╪═════════════╪══════════════════╪══════════════════╡
│ Regular       ┆ 6.277832     ┆ 6428.5        ┆ 1024        ┆ 3.499023         ┆ 6.277832         │
│ Tourist       ┆ 6.185185     ┆ 2505.0        ┆ 405         ┆ 3.518519         ┆ 6.185185         │
│ New           ┆ 6.268827     ┆ 3579.5        ┆ 571         ┆ 3.502627         ┆ 6.268827         │
└───────────────┴──────────────┴───────────────┴─────────────┴──────────────────┴──────────────────┘

Obvious To put everything together

Let's build a complete business abbreviation:

# Create a complete business summary
business_summary = {
    'total_revenue': df_enhanced['total_sale'].sum(),
    'total_transactions': df_enhanced.height,
    'average_transaction': df_enhanced['total_sale'].mean(),
    'best_selling_drink': drink_performance.row(0)[0],  # First row, first column
    'customer_satisfaction': df_enhanced['rating'].mean()
}

print("n=== BEAN THERE COFFEE SHOP - SUMMARY ===")
for key, value in business_summary.items():
    if isinstance(value, float) and key != 'customer_satisfaction':
        print(f"{key.replace('_', ' ').title()}: ${value:.2f}")
    else:
        print(f"{key.replace('_', ' ').title()}: {value}")

Which is output:

=== BEAN THERE COFFEE SHOP - SUMMARY ===
Total Revenue: $12513.00
Total Transactions: 2000
Average Transaction: $6.26
Best Selling Drink: Americano
Customer Satisfaction: 3.504

Obvious Store

You just completed the wide introduction of data analysis and polars! Using our coffee shop examin, (I hope) learn how to change the green purchase into logical business information.

Remember, having the ability to a data analysis is like reading cooking – starts in basic ways (as examples in this guide) and gradually they are better. The key is acting and you want to know.

In the future when analyzing dataset, ask yourself:

What does this data talk about?
What methods can be hid here?
What questions can answer this data?

Then use your new polar skills to find. Happy Assessment!

Count Priya c He is the writer and a technical writer from India. He likes to work in mathematical communication, data science and content creation. His areas of interest and professionals includes deliefs, data science and natural language. She enjoys reading, writing, coding, and coffee! Currently, he works by reading and sharing his knowledge and engineering society by disciples of teaching, how they guide, pieces of ideas, and more. Calculate and create views of the resources and instruction of codes.

Source link

nimda September 19, 2025

0 6 8 minutes read