ANI

Working with Billion-Row Datasets in Python (Using Vaex)

0 2 7 minutes read

Working with Billion-Row Datasets in Python (Using Vaex)

Photo by the Author

# Introduction

Managing large datasets containing billions of rows is a major challenge in data science and analytics. Traditional tools like Pandas they work well for small to medium datasets that fit in system memory, but as dataset sizes grow, they become slow, use large amounts of random access memory (RAM) to run, and often crash with out-of-memory (OOM) errors.

This is where it is Vaxa high-performance Python library for advanced data processing, enters. Vaex allows you to explore, edit, visualize, and analyze large tabular data sets efficiently and easily, even on a standard laptop.

# What is Vaex?

Vaex is a lazy, out-of-the-box Python library DataFrames (similar to Pandas) is designed for data larger than your RAM.

Important features:

Vaex is designed to handle large data sets efficiently by working directly with the data on disk and reading only the required parts, avoiding loading all files into memory.

Vaex uses lazy evaluation, which means that performance is only calculated when results are requested, and it can open column databases – which store data in columns instead of rows – such as HDF5, Apache Arrow, and Parquet on the fly by using a memory map.

Built on advanced C/C++ backends, Vaex can calculate calculations and perform operations at billions of rows per second, making large-scale analysis fast even on modest hardware.

It has an application programming interface (API) similar to Pandas that makes the transition smooth for users who are already familiar with Pandas, helping them use big data capabilities without a steep learning curve.

# Comparing Vaex and Dask

Vaex is not like Dask overall but similar to Dask DataFramesbuilt on Pandas DataFrames. This means that Dask inherits some Panda issues, such as the requirement that data be completely loaded into RAM for processing in other instances. This is not the case with Vaex. Vaex does not a DataFrame copy, to process more DataFrames on machines with very little memory. Both Vaex and Dask use lazy processing. The main difference is that Vaex calculates the field only when it is required, whereas with Dask, we need to explicitly call compute() work. Data needs to be in HDF5 or Apache Arrow format to take full advantage of Vaex.

# Why Traditional Tools Are Difficult

Tools like Pandas load the entire dataset into RAM before processing. For datasets larger than memory, this leads to:

Slow operation
System crashes (OOM errors)
Limited collaboration

Vaex does not load the entire dataset into memory; instead, it says:

Streams data from disk
It uses virtual columns and lazy checks to slow down the calculation
Only perform effects when they are clearly needed

This allows analysis of large datasets even on modest hardware.

# How Vaex Works Under the Hood

// Execution Out of Context

Vaex reads data from disk as needed using a memory map. This allows it to work on data files that are much larger than the RAM can handle.

// Lazy Evaluation

Instead of executing each task immediately, Vaex creates a calculation graph. Calculations are only performed when you request a result (eg when printing or plotting).

// Visible Columns

Virtual columns are expressions defined in the dataset that do not reside in memory until the computer. This saves RAM and speeds up the workflow.

# Getting Started With Vaex

// Includes Vaex

Create a clean virtual environment:

conda create -n vaex_demo python=3.9
conda activate vaex_demo

Install Vaex with pip:

pip install vaex-core vaex-hdf5 vaex-viz

Upgrade Vaex:

pip install --upgrade vaex

Install supporting libraries:

pip install pandas numpy matplotlib

// Open to Large Data Sets

Vaex supports popular storage formats for managing large datasets. It can work directly with HDF5 files, Apache Arrow, and Parquet, all of which are optimized for efficient disk access and fast analysis. Although Vaex can also read CSV files, it first needs to convert them to a more efficient format to improve performance when working with large data sets.

How to open a Parquet file:

import vaex

df = vaex.open("your_huge_dataset.parquet")
print(df)

Now you can test the structure of the dataset without loading it into memory.

// Key Performance in Vaex

Filter data:

filtered = df[df.sales > 1000]

This does not calculate the result immediately; instead, the filter is registered and used only when needed.

Teams and mergers:

result = df.groupby("category", agg=vaex.agg.mean("sales"))
print(result)

Vaex computes clustering efficiently using parallel algorithms and less memory.

Computer statistics:

mean_price = df["price"].mean()
print(mean_price)

Vaex calculates this over time by scanning the dataset in chunks.

// Demonstration with Taxi Data Set

We will create a data set of 50 million consecutive taxis to demonstrate Vaex's capabilities:

import vaex
import numpy as np
import pandas as pd
import time

Set a random seed to reproduce:

np.random.seed(42)
print("Creating 50 million row dataset...")
n = 50_000_000

Generate real-time data for taxi trips:

data = {
    'passenger_count': np.random.randint(1, 7, n),
    'trip_distance': np.random.exponential(3, n),
    'fare_amount': np.random.gamma(10, 1.5, n),
    'tip_amount': np.random.gamma(2, 1, n),
    'total_amount': np.random.gamma(12, 1.8, n),
    'payment_type': np.random.choice(['credit', 'cash', 'mobile'], n),
    'pickup_hour': np.random.randint(0, 24, n),
    'pickup_day': np.random.randint(1, 8, n),
}

Create Vaex DataFrame:

df_vaex = vaex.from_dict(data)

Export to HDF5 format (works well in Vaex):

df_vaex.export_hdf5('taxi_50M.hdf5')
print(f"Created dataset with {n:,} rows")

Output:

Shape: (50000000, 8)
Created dataset with 50,000,000 rows

We now have a data set of 50 million rows with 8 columns.

// Vaex vs. Pandas Performance

To open large files with Vaex memory map opening:

start = time.time()
df_vaex = vaex.open('taxi_50M.hdf5')
vaex_time = time.time() - start

print(f"Vaex opened {df_vaex.shape[0]:,} rows in {vaex_time:.4f} seconds")
print(f"Memory usage: ~0 MB (memory-mapped)")

Output:

Vaex opened 50,000,000 rows in 0.0199 seconds
Memory usage: ~0 MB (memory-mapped)

Pandas: Load in memory (don't try this with 50M lines!):

# This would fail on most machines
df_pandas = pd.read_hdf('taxi_50M.hdf5')

This will lead to a memory error! Vaex opens files almost instantly, regardless of their size, because it does not load data into memory.

Basic integration: Compute statistics on 50 million rows:

start = time.time()
stats = {
    'mean_fare': df_vaex.fare_amount.mean(),
    'mean_distance': df_vaex.trip_distance.mean(),
    'total_revenue': df_vaex.total_amount.sum(),
    'max_fare': df_vaex.fare_amount.max(),
    'min_fare': df_vaex.fare_amount.min(),
}
agg_time = time.time() - start

print(f"nComputed 5 aggregations in {agg_time:.4f} seconds:")
print(f"  Mean fare: ${stats['mean_fare']:.2f}")
print(f"  Mean distance: {stats['mean_distance']:.2f} miles")
print(f"  Total revenue: ${stats['total_revenue']:,.2f}")
print(f"  Fare range: ${stats['min_fare']:.2f} - ${stats['max_fare']:.2f}")

Output:

Computed 5 aggregations in 0.8771 seconds:
  Mean fare: $15.00
  Mean distance: 3.00 miles
  Total revenue: $1,080,035,827.27
  Fare range: $1.25 - $55.30

Sort functions: Sort long journeys:

start = time.time()
long_trips = df_vaex[df_vaex.trip_distance > 10]
filter_time = time.time() - start

print(f"nFiltered for trips > 10 miles in {filter_time:.4f} seconds")
print(f"  Found: {len(long_trips):,} long trips")
print(f"  Percentage: {(len(long_trips)/len(df_vaex)*100):.2f}%")

Output:

Filtered for trips > 10 miles in 0.0486 seconds
Found: 1,784,122 long trips
Percentage: 3.57%

Most cases:

start = time.time()
premium_trips = df_vaex[(df_vaex.trip_distance > 5) & 
                        (df_vaex.fare_amount > 20) & 
                        (df_vaex.payment_type == 'credit')]
multi_filter_time = time.time() - start

print(f"nMultiple condition filter in {multi_filter_time:.4f} seconds")
print(f"  Premium trips (>5mi, >$20, credit): {len(premium_trips):,}")

Output:

Multiple condition filter in 0.0582 seconds
Premium trips (>5mi, >$20, credit): 457,191

Group activities:

start = time.time()
by_payment = df_vaex.groupby('payment_type', agg={
    'mean_fare': vaex.agg.mean('fare_amount'),
    'mean_tip': vaex.agg.mean('tip_amount'),
    'total_trips': vaex.agg.count(),
    'total_revenue': vaex.agg.sum('total_amount')
})
groupby_time = time.time() - start

print(f"nGroupBy operation in {groupby_time:.4f} seconds")
print(by_payment.to_pandas_df())

Output:

GroupBy operation in 5.6362 seconds
  payment_type  mean_fare  mean_tip  total_trips  total_revenue
0       credit  15.001817  2.000065     16663623   3.599456e+08
1       mobile  15.001200  1.999679     16667691   3.600165e+08
2         cash  14.999397  2.000115     16668686   3.600737e+08

Each complex group:

start = time.time()
by_hour = df_vaex.groupby('pickup_hour', agg={
    'avg_distance': vaex.agg.mean('trip_distance'),
    'avg_fare': vaex.agg.mean('fare_amount'),
    'trip_count': vaex.agg.count()
})
complex_groupby_time = time.time() - start

print(f"nGroupBy by hour in {complex_groupby_time:.4f} seconds")
print(by_hour.to_pandas_df().head(10))

Output:

GroupBy by hour in 1.6910 seconds
   pickup_hour  avg_distance   avg_fare  trip_count
0            0      2.998120  14.997462     2083481
1            1      3.000969  14.998814     2084650
2            2      3.003834  15.001777     2081962
3            3      3.001263  14.998196     2081715
4            4      2.998343  14.999593     2083882
5            5      2.997586  15.003988     2083421
6            6      2.999887  15.011615     2083213
7            7      3.000240  14.996892     2085156
8            8      3.002640  15.000326     2082704
9            9      2.999857  14.997857     2082284

// Vaex advanced features

Virtual columns (computed columns) allow adding columns without copying data:

df_vaex['tip_percentage'] = (df_vaex.tip_amount / df_vaex.fare_amount) * 100
df_vaex['is_generous_tipper'] = df_vaex.tip_percentage > 20
df_vaex['rush_hour'] = (df_vaex.pickup_hour >= 7) & (df_vaex.pickup_hour <= 9) | 
                        (df_vaex.pickup_hour >= 17) & (df_vaex.pickup_hour <= 19)

This is calculated on the fly without high memory:

print("Added 3 virtual columns with zero memory overhead")
generous_tippers = df_vaex[df_vaex.is_generous_tipper]
print(f"Generous tippers (>20% tip): {len(generous_tippers):,}")

rush_hour_trips = df_vaex[df_vaex.rush_hour]
print(f"Rush hour trips: {len(rush_hour_trips):,}")

Output:

VIRTUAL COLUMNS
Added 3 virtual columns with zero memory overhead
Generous tippers (>20% tip): 11,997,433
Rush hour trips: 12,498,848

Correlation analysis:

corr = df_vaex.correlation(df_vaex.trip_distance, df_vaex.fare_amount)
print(f"Correlation (distance vs fare): {corr:.4f}")

Percentage:

try:
    percentiles = df_vaex.percentile_approx('fare_amount', [25, 50, 75, 90, 95, 99])
except AttributeError:
    percentiles = [
        df_vaex.fare_amount.quantile(0.25),
        df_vaex.fare_amount.quantile(0.50),
        df_vaex.fare_amount.quantile(0.75),
        df_vaex.fare_amount.quantile(0.90),
        df_vaex.fare_amount.quantile(0.95),
        df_vaex.fare_amount.quantile(0.99),
    ]

print(f"nFare percentiles:")
print(f"25th: ${percentiles[0]:.2f}")
print(f"50th (median): ${percentiles[1]:.2f}")
print(f"75th: ${percentiles[2]:.2f}")
print(f"90th: ${percentiles[3]:.2f}")
print(f"95th: ${percentiles[4]:.2f}")
print(f"99th: ${percentiles[5]:.2f}")

Standard deviation:

std_fare = df_vaex.fare_amount.std()
print(f"nStandard deviation of fares: ${std_fare:.2f}")

More useful statistics:

print(f"nAdditional statistics:")
print(f"Mean: ${df_vaex.fare_amount.mean():.2f}")
print(f"Min: ${df_vaex.fare_amount.min():.2f}")
print(f"Max: ${df_vaex.fare_amount.max():.2f}")

Output:

Correlation (distance vs fare): -0.0001

Fare percentiles:
  25th: $11.57
  50th (median): $nan
  75th: $nan
  90th: $nan
  95th: $nan
  99th: $nan

Standard deviation of fares: $4.74

Additional statistics:
  Mean: $15.00
  Min: $1.25
  Max: $55.30

// Transmission of Data

# Export filtered data
high_value_trips = df_vaex[df_vaex.total_amount > 50]

Exports to different formats:

start = time.time()
high_value_trips.export_hdf5('high_value_trips.hdf5')
export_time = time.time() - start
print(f"Exported {len(high_value_trips):,} rows to HDF5 in {export_time:.4f}s")

You can also export to CSV, Parquet, etc.:

high_value_trips.export_csv('high_value_trips.csv')
high_value_trips.export_parquet('high_value_trips.parquet')

Output:

Exported 13,054 rows to HDF5 in 5.4508s

Performance summary dashboard

print("VAEX PERFORMANCE SUMMARY")
print(f"Dataset size:           {n:,} rows")
print(f"File size on disk:      ~2.4 GB")
print(f"RAM usage:              ~0 MB (memory-mapped)")
print()
print(f"Open time:              {vaex_time:.4f} seconds")
print(f"Single aggregation:     {agg_time:.4f} seconds")
print(f"Simple filter:          {filter_time:.4f} seconds")
print(f"Complex filter:         {multi_filter_time:.4f} seconds")
print(f"GroupBy operation:      {groupby_time:.4f} seconds")
print()
print(f"Throughput:             ~{n/groupby_time:,.0f} rows/second")

Output:

VAEX PERFORMANCE SUMMARY
Dataset size:           50,000,000 rows
File size on disk:      ~2.4 GB
RAM usage:              ~0 MB (memory-mapped)

Open time:              0.0199 seconds
Single aggregation:     0.8771 seconds
Simple filter:          0.0486 seconds
Complex filter:         0.0582 seconds
GroupBy operation:      5.6362 seconds

Throughput:             ~8,871,262 rows/second

# Concluding thoughts

Vaex is ideal for working with large data sets larger than 1GB and not fitting in RAM, analyzing big data, engineering features with millions of rows, or building data processing pipelines.

You should not use Vaex on datasets smaller than 100MB. In this case, using Panda is easy. If you are dealing with complex joins across multiple tables, using structured language (SQL) databases may be better. If you need the full Pandas API, note that Vaex has limited compatibility. For real-time streaming data, other tools are more appropriate.

Vaex fills a gap in the Python data science ecosystem: the ability to work on billion-row data sets efficiently and interactively without loading everything into memory. Its off-base architecture, lazy model, and advanced algorithms make it a powerful tool for analyzing big data even on a laptop. Whether you're examining large logs, scientific surveys, or high-frequency time series, Vaex helps bridge the gap between ease of use and scalability of big data.

Long Shithu is a software engineer and technical writer who likes to use cutting-edge technology to make interesting stories, with a keen eye for detail and the ability to simplify complex concepts. You can also find Shittu Twitter.

Source link

nimda 9 hours ago

0 2 7 minutes read