Machine Learning

Modern DataFrames in Python: A hands-on tutorial with polars and duckdb

If you're in Python for data, you may have experienced the frustration of minutes waiting for a pandas job to finish.

At first, everything seems to be going well, but as your data grows and your flow becomes more complex, your laptop suddenly sounds like it's preparing to be lifted.

A few months ago, I worked on an e-commerce transaction analysis project with over three million rows of data.

It was a great experience, but most of the time, I watched simple Groupby tasks that used to run in seconds suddenly stopped in minutes.

At that time, I saw amazing pandas, but not always enough.

This document examines some modern pandas methods, including polars and duckd, and examines how they can simplify the handling of large datasets.

To be clear, let me be upfront about a few things before we begin.

This article is not a deep dive into rust memory management or a declaration that pandas is obsolete.

Instead, it is a practical, hands-on guide. You'll see real-life examples, personal experiences, and actionable insights that can save you time and insight.


Why can pandas feel slow?

Back when I was in the E-Commerce Project, I remember working with CSV files of more than two gigabytes, and all the filters or integrations of Pandas often take several minutes to complete.

At that time, I would look at the screen, wishing I could grab a coffee or put on a few episodes of the show while the code ran.

The biggest pain points I encountered were speed, memory, and workflow complexity.

We all know how large CSV files consume a lot of RAM, sometimes more than it could comfortably handle. In addition, many changes also make the code difficult to maintain and slow to execute.

Polars and Duckd approach these challenges in different ways.

Polars, built on rust, uses multi-threaded processing to process large datasets efficiently.

Duckdb, on the other hand, is designed for analytics and executes SQL queries without requiring you to load everything in memory.

Basically, each of them has its own powerful decoration. Polars is a retester, and duckdb is a kind of memory wizard.

And the best part? Both integrate seamlessly with Python, allowing you to improve your workflow without a complete rewrite.

Setting up your environment

Before we start coding, make sure your environment is ready. For consistency, I used pandas 2.2.0, polars 0.20.0, and DuckDB 1.9.0.

Pinning versions can save you a headache when following a tutorial or sharing code.

pip install pandas==2.2.0 polars==0.20.0 duckdb==1.9.0

In Python, import the libraries:

import pandas as pd
import polars as pl
import duckdb
import warnings
warnings.filterwarnings("ignore")

For example, I will use an e-commerce sales dataset with columns such as order ID, product ID, country, country, revenue, and date. You can download the same data from kaggle or generate synthetic data.

Loading data

Properly loading data sets the tone for your workflow. I remember a project where the CSV file had almost 5 million arrows.

Pandas handled it, but load times were long, and repeated loads during testing were painful.

It was one of those times when you wish your laptop had a “forward” button.

Switching to polars and duckdb completely improved everything, and suddenly, I can enter and manipulate data almost instantly, making the process of testing and iteration much more enjoyable.

About Pandas:

df_pd = pd.read_csv("sales.csv")
print(df_pd.head(3))

For polars:

df_pl = pl.read_csv("sales.csv")
print(df_pl.head(3))

With Duckdb:

con = duckdb.connect()
df_duck = con.execute("SELECT * FROM 'sales.csv'").df()
print(df_duck.head(3))

DuckDB can directly query CSVS without loading all the data into memory, making it much easier to work with large files.

Data filtering

The problem here is that sorting in pandas can be slow when dealing with millions of rows. I once needed to analyze European transactions in a large sales database. Pandas took minutes, downgrading my analysis.

About Pandas:

filtered_pd = df_pd[df_pd.region == "Europe"]

Polars are fast and can process many filters well:

filtered_pl = df_pl.filter(pl.col("region") == "Europe")

DuckDB uses SQL Syntax:

filtered_duck = con.execute("""
    SELECT *
    FROM 'sales.csv'
    WHERE region = 'Europe'
""").df()

Now you can filter through large datasets in seconds instead of minutes, leaving you more time to focus on the views that really matter.

Aggregate large data quickly

Attacks are common when pandas start to feel sluggish. Consider calculating the total revenue in each country of the marketing report.

In Pandas:

agg_pd = df_pd.groupby("country")["revenue"].sum().reset_index()

At the polars:

agg_pl = df_pl.groupby("country").agg(pl.col("revenue").sum())

At Duckdb:

agg_duck = con.execute("""
    SELECT country, SUM(revenue) AS total_revenue
    FROM 'sales.csv'
    GROUP BY country
""").df()

I remember using this combination on a 10 million dataset. At Pandas, it took about half an hour. The polars completed the same operation in less than a minute.

The feeling of relief was almost like finishing the race and seeing your legs still working.

Joining data by measurement

Data joining is one of those things that sounds simple until you actually dig deeper.

In real projects, your data often resides in multiple sources, so you must aggregate it using shared columns such as customer IDs.

I learned this the hard way while working on a project that needed to aggregate millions of customer orders with large demographic data.

Each file was big enough on its own, but putting them together felt like trying to force two pieces of a puzzle together while your laptop begged for mercy.

Pandas took so long that I started when Joins join people the same time how long it takes for their microwave popcorn.

SPOOREER: Popcorn won every time.

Polars and duckdb got me out of the way.

About Pandas:

merged_pd = df_pd.merge(pop_df_pd, on="country", how="left")

Polars:

merged_pl = df_pl.join(pop_df_pl, on="country", how="left")

DUCKDB:

merged_duck = con.execute("""
    SELECT *
    FROM 'sales.csv' s
    LEFT JOIN 'pop.csv' p
    USING (country)
""").df()

Join large datasets used to free your workflow now runs smoothly and efficiently.

A lazy experiment in polars

One thing I didn't appreciate early on in my data science journey was how much time was spent running the transformation line by line.

The polars approach this exception.

It uses It's a process called lazy testing, which basically waits until you've finished defining your changes before releasing any tasks.

It examines the entire pipeline, determines the most efficient path, and releases them all at once.

It's like having a friend who listens to your order before entering the kitchen, instead of someone who takes each order separately and keeps going back.

This TDS article explains lazy testing.

Here's what the flow looks like:

Pandas:

df = df[df["amount"] > 100]
df = df.groupby("segment").agg({"amount": "mean"})
df = df.sort_values("amount")

Polars Lazy Mode:

import polars as pl

df_lazy = (
    pl.scan_csv("sales.csv")
      .filter(pl.col("amount") > 100)
      .groupby("segment")
      .agg(pl.col("amount").mean())
      .sort("amount")
)

result = df_lazy.collect()

The first time I used lazy mode, it felt like it didn't see immediate results. But when I count the last one .collect()the difference in speed was obvious.

Lazy testing won't magically solve every problem, but it brings a level of efficiency that pandas isn't designed for.


CONCLUSION AND TERMINATION

Working with big details doesn't have to feel like fighting your tools.

Using polars and duckdb showed me that this problem was not always detailed. Sometimes, it was a tool I used to carry.

If there's one thing you take away from this tutorial, let it be this: You don't have to abandon pandas, but you can reach for something better when your datasets start pushing their limits.

Polars gives you speed and efficient execution, then DuckDB allows you to look at large files like never before. Together, they make working with big data feel more tangible and less stressful.

If you want to go deeper into the ideas explored in this tutorial, the official documentation for polars and Duckd are good places to start.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button