ANI

5 easy ways to make pandas that you should try

5 easy ways to make pandas that you should try
Photo by the Author

The obvious Getting started

Developers use adulterous head For data manipulation, however, it can be slow, especially with large datasets. Because of this, many are looking for faster and lighter alternatives. These options preserve the important features needed for analysis while focusing on speed, low memory usage, and simplicity. In this article, we take a look at five other panda-friendly ways you can try.

The obvious 1. Duckdb

Duckdb it's the same Sqlite of analysis. You can run SQL queries directly from Comma-separated values ​​(CSV) files. It helps if you know SQL or work with machine learning pipelines. Enter with:

We'll use the Titanic Dataset and run a simple SQL query on it like this:

import duckdb

url = "

# Run SQL query on the CSV
result = duckdb.query(f"""
    SELECT sex, age, survived
    FROM read_csv_auto('{url}')
    WHERE age > 18
""").to_df()

print(result.head())

Output:


      sex     age   survived
0     male    22.0          0
1   female    38.0          1
2   female    26.0          1
3   female    35.0          1
4     male    35.0          0

DuckDB runs an SQL query directly on the CSV file and converts the output to a DataFrame. You get the speed of SQL with Python Fleam.

The obvious 2. Polars

Polars It is one of the most popular data libraries available today. Used in Rust language and is exceptionally fast with minimal memory requirements. The syntax is clean too. Let's install it using PIP:

Now, let's use the Titanic Dataset to cover a simple example:

import polars as pl

# Load dataset 
url = "
df = pl.read_csv(url)

result = df.filter(pl.col("age") > 40).select(["sex", "age", "survived"])
print(result)

Output:


shape: (150, 3)
┌────────┬──────┬──────────┐
│ sex    ┆ age  ┆ survived │
│ ---    ┆ ---  ┆ ---      │
│ str    ┆ f64  ┆ i64      │
╞════════╪══════╪══════════╡
│ male   ┆ 54.0 ┆ 0        │
│ female ┆ 58.0 ┆ 1        │
│ female ┆ 55.0 ┆ 1        │
│ male   ┆ 66.0 ┆ 0        │
│ male   ┆ 42.0 ┆ 0        │
│ …      ┆ …    ┆ …        │
│ female ┆ 48.0 ┆ 1        │
│ female ┆ 42.0 ┆ 1        │
│ female ┆ 47.0 ┆ 1        │
│ male   ┆ 47.0 ┆ 0        │
│ female ┆ 56.0 ┆ 1        │
└────────┴──────┴──────────┘

Polars reads the CSV, filters rows based on age status, and selects a preset set of columns.

The obvious 3. Pyro

The pyarrow Is a lightweight library of colummun data. Tools like using polars Apache missile with speed and memory performance. It's not a perfect pandas environment but it's good enough to read files and work. Enter with:

For our example, let's use the IRIS Daset in CSV form as follows:

import pyarrow.csv as csv
import pyarrow.compute as pc
import urllib.request

# Download the Iris CSV 
url = "
local_file = "iris.csv"
urllib.request.urlretrieve(url, local_file)

# Read with PyArrow
table = csv.read_csv(local_file)

# Filter rows
filtered = table.filter(pc.greater(table['sepal_length'], 5.0))

print(filtered.slice(0, 5))

Output:


pyarrow.Table
sepal_length: double
sepal_width: double
petal_length: double
petal_width: double
species: string
----
sepal_length: [[5.1,5.4,5.4,5.8,5.7]]
sepal_width: [[3.5,3.9,3.7,4,4.4]]
petal_length: [[1.4,1.7,1.5,1.2,1.5]]
petal_width: [[0.2,0.4,0.2,0.2,0.4]]
species: [["setosa","setosa","setosa","setosa","setosa"]]

PYarrow reads CSV and converts it to columnar format. Each column name and type is written in a clear schema. This setup makes it faster to scan and filter large datasets.

The obvious 4. Modin

Medin it's for anyone who wants to work quickly without having to learn a new library. It uses the same pandas api but works the same way. You don't need to change your existing code; Just refresh the import. Everything else works like normal panda. Install with PIP:

For a better understanding, let's try a small example using the same Titanic Dataset as follows:

import modin.pandas as pd
url = "

# Load the dataset
df = pd.read_csv(url)

# Filter the dataset 
adults = df[df["age"] > 18]

# Select only a few columns to display
adults_small = adults[["survived", "sex", "age", "class"]]

# Display result
adults_small.head()

Output:


   survived     sex   age   class
0         0    male  22.0   Third
1         1  female  38.0   First
2         1  female  26.0   Third
3         1  female  35.0   First
4         0    male  35.0   Third

The mod is spread across all CPU cores, meaning you'll get better performance without doing anything extra.

The obvious 5. Dask

How do you manage big data without increasing the ram? You are painted A good choice when you have files that are larger in size than your computer's random memory (RAM). It uses lazy testing, so it doesn't load all the data into memory. This helps you process millions of rows efficiently. Enter with:

pip install dask[complete]

To try, we can use the Chicago Crime Dataset, like this:

import dask.dataframe as dd
import urllib.request

url = "
local_file = "chicago_crime.csv"
urllib.request.urlretrieve(url, local_file)

# Read CSV with Dask (lazy evaluation)
df = dd.read_csv(local_file, dtype=str)  # all columns as string

# Filter crimes classified as 'THEFT'
thefts = df[df['Primary Type'] == 'THEFT']

# Select a few relevant columns
thefts_small = thefts[["ID", "Date", "Primary Type", "Description", "District"]]

print(thefts_small.head())

Output:


          ID                   Date Primary Type       Description District            
5   13204489 09/06/2023 11:00:00 AM        THEFT         OVER $500      001
50  13179181 08/17/2023 03:15:00 PM        THEFT      RETAIL THEFT      014
51  13179344 08/17/2023 07:25:00 PM        THEFT      RETAIL THEFT      014
53  13181885 08/20/2023 06:00:00 AM        THEFT    $500 AND UNDER      025
56  13184491 08/22/2023 11:44:00 AM        THEFT      RETAIL THEFT      014

Sorting (Primary Type == 'THEFT') Selecting columns is a lazy task. Sorting is instantaneous because Dask data processes data in chunks rather than loading everything at once.

The obvious Lasting

We covered five alternatives to Pandas and how to use them. The theme keeps things simple and focused. See each library's official documentation for complete details:

If you run into any issues, leave a comment and I'll help.

Kanwal Mehreen Is a machine learning engineer and technical writer with a strong interest in data science and the intersection of AI and medicine. Authored the eBook “Increasing Productivity with Chatgpt”. As a Google Event 2022 APAC host, she is a symbol of diversity and excellence in education. He has also been recognized as a teradata distinction in tech scholar, a mitacs Globalk research scholar, and a Harvard WeCode Scholar. Kanwal is a passionate advocate for change, who has created femcodes to empower women.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button