5 easy ways to make pandas that you should try


Photo by the Author
The obvious Getting started
Developers use adulterous head For data manipulation, however, it can be slow, especially with large datasets. Because of this, many are looking for faster and lighter alternatives. These options preserve the important features needed for analysis while focusing on speed, low memory usage, and simplicity. In this article, we take a look at five other panda-friendly ways you can try.
The obvious 1. Duckdb
Duckdb it's the same Sqlite of analysis. You can run SQL queries directly from Comma-separated values (CSV) files. It helps if you know SQL or work with machine learning pipelines. Enter with:
We'll use the Titanic Dataset and run a simple SQL query on it like this:
import duckdb
url = "
# Run SQL query on the CSV
result = duckdb.query(f"""
SELECT sex, age, survived
FROM read_csv_auto('{url}')
WHERE age > 18
""").to_df()
print(result.head())
Output:
sex age survived
0 male 22.0 0
1 female 38.0 1
2 female 26.0 1
3 female 35.0 1
4 male 35.0 0
DuckDB runs an SQL query directly on the CSV file and converts the output to a DataFrame. You get the speed of SQL with Python Fleam.
The obvious 2. Polars
Polars It is one of the most popular data libraries available today. Used in Rust language and is exceptionally fast with minimal memory requirements. The syntax is clean too. Let's install it using PIP:
Now, let's use the Titanic Dataset to cover a simple example:
import polars as pl
# Load dataset
url = "
df = pl.read_csv(url)
result = df.filter(pl.col("age") > 40).select(["sex", "age", "survived"])
print(result)
Output:
shape: (150, 3)
┌────────┬──────┬──────────┐
│ sex ┆ age ┆ survived │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 │
╞════════╪══════╪══════════╡
│ male ┆ 54.0 ┆ 0 │
│ female ┆ 58.0 ┆ 1 │
│ female ┆ 55.0 ┆ 1 │
│ male ┆ 66.0 ┆ 0 │
│ male ┆ 42.0 ┆ 0 │
│ … ┆ … ┆ … │
│ female ┆ 48.0 ┆ 1 │
│ female ┆ 42.0 ┆ 1 │
│ female ┆ 47.0 ┆ 1 │
│ male ┆ 47.0 ┆ 0 │
│ female ┆ 56.0 ┆ 1 │
└────────┴──────┴──────────┘
Polars reads the CSV, filters rows based on age status, and selects a preset set of columns.
The obvious 3. Pyro
The pyarrow Is a lightweight library of colummun data. Tools like using polars Apache missile with speed and memory performance. It's not a perfect pandas environment but it's good enough to read files and work. Enter with:
For our example, let's use the IRIS Daset in CSV form as follows:
import pyarrow.csv as csv
import pyarrow.compute as pc
import urllib.request
# Download the Iris CSV
url = "
local_file = "iris.csv"
urllib.request.urlretrieve(url, local_file)
# Read with PyArrow
table = csv.read_csv(local_file)
# Filter rows
filtered = table.filter(pc.greater(table['sepal_length'], 5.0))
print(filtered.slice(0, 5))
Output:
pyarrow.Table
sepal_length: double
sepal_width: double
petal_length: double
petal_width: double
species: string
----
sepal_length: [[5.1,5.4,5.4,5.8,5.7]]
sepal_width: [[3.5,3.9,3.7,4,4.4]]
petal_length: [[1.4,1.7,1.5,1.2,1.5]]
petal_width: [[0.2,0.4,0.2,0.2,0.4]]
species: [["setosa","setosa","setosa","setosa","setosa"]]
PYarrow reads CSV and converts it to columnar format. Each column name and type is written in a clear schema. This setup makes it faster to scan and filter large datasets.
The obvious 4. Modin
Medin it's for anyone who wants to work quickly without having to learn a new library. It uses the same pandas api but works the same way. You don't need to change your existing code; Just refresh the import. Everything else works like normal panda. Install with PIP:
For a better understanding, let's try a small example using the same Titanic Dataset as follows:
import modin.pandas as pd
url = "
# Load the dataset
df = pd.read_csv(url)
# Filter the dataset
adults = df[df["age"] > 18]
# Select only a few columns to display
adults_small = adults[["survived", "sex", "age", "class"]]
# Display result
adults_small.head()
Output:
survived sex age class
0 0 male 22.0 Third
1 1 female 38.0 First
2 1 female 26.0 Third
3 1 female 35.0 First
4 0 male 35.0 Third
The mod is spread across all CPU cores, meaning you'll get better performance without doing anything extra.
The obvious 5. Dask
How do you manage big data without increasing the ram? You are painted A good choice when you have files that are larger in size than your computer's random memory (RAM). It uses lazy testing, so it doesn't load all the data into memory. This helps you process millions of rows efficiently. Enter with:
pip install dask[complete]
To try, we can use the Chicago Crime Dataset, like this:
import dask.dataframe as dd
import urllib.request
url = "
local_file = "chicago_crime.csv"
urllib.request.urlretrieve(url, local_file)
# Read CSV with Dask (lazy evaluation)
df = dd.read_csv(local_file, dtype=str) # all columns as string
# Filter crimes classified as 'THEFT'
thefts = df[df['Primary Type'] == 'THEFT']
# Select a few relevant columns
thefts_small = thefts[["ID", "Date", "Primary Type", "Description", "District"]]
print(thefts_small.head())
Output:
ID Date Primary Type Description District
5 13204489 09/06/2023 11:00:00 AM THEFT OVER $500 001
50 13179181 08/17/2023 03:15:00 PM THEFT RETAIL THEFT 014
51 13179344 08/17/2023 07:25:00 PM THEFT RETAIL THEFT 014
53 13181885 08/20/2023 06:00:00 AM THEFT $500 AND UNDER 025
56 13184491 08/22/2023 11:44:00 AM THEFT RETAIL THEFT 014
Sorting (Primary Type == 'THEFT') Selecting columns is a lazy task. Sorting is instantaneous because Dask data processes data in chunks rather than loading everything at once.
The obvious Lasting
We covered five alternatives to Pandas and how to use them. The theme keeps things simple and focused. See each library's official documentation for complete details:
If you run into any issues, leave a comment and I'll help.
Kanwal Mehreen Is a machine learning engineer and technical writer with a strong interest in data science and the intersection of AI and medicine. Authored the eBook “Increasing Productivity with Chatgpt”. As a Google Event 2022 APAC host, she is a symbol of diversity and excellence in education. He has also been recognized as a teradata distinction in tech scholar, a mitacs Globalk research scholar, and a Harvard WeCode Scholar. Kanwal is a passionate advocate for change, who has created femcodes to empower women.



