How to Handle Large Datasets in Python Even if You're a Beginner


Photo by the Author
# Introduction
Working with large datasets in Python often leads to a common problem: you load your data with it Pandasand your system slows down or crashes completely. This usually happens because you are trying to load everything into memory at once.
Most memory problems stem from How you load and process data. With a few effective tricks, you can handle data sets that are much larger than your available memory.
In this article, you'll learn seven techniques for working with large datasets effectively in Python. We'll start simple and build, so by the end, you'll know exactly which method fits your use case.
🔗 You can get the code on GitHub. If you like, you can start this sample data generator Python script to get sample CSV files and use code snippets to process them.
# 1. Read Data in Chunks
The best way for beginners is to process your data in small chunks instead of uploading everything at once.
Consider a situation where you have a large sales dataset and want to capture all of the revenue. The following code shows this method:
import pandas as pd
# Define chunk size (number of rows per chunk)
chunk_size = 100000
total_revenue = 0
# Read and process the file in chunks
for chunk in pd.read_csv('large_sales_data.csv', chunksize=chunk_size):
# Process each chunk
total_revenue += chunk['revenue'].sum()
print(f"Total Revenue: ${total_revenue:,.2f}")
Instead of loading all 10 million rows at once, we load 100,000 rows at a time. We calculate the sum of each piece and add it to our working price. Your RAM can only hold 100,000 lines, no matter how big the file is.
When is this used?: If you need to perform aggregations (sum, count, average) or filter operations on large files.
# 2. Use Only Vertical Columns
In general, you don't need every column in your dataset. Loading only what you need can reduce memory usage significantly.
Let's say you are analyzing customer data, but you only need age and purchase price, there are many other columns:
import pandas as pd
# Only load the columns you actually need
columns_to_use = ['customer_id', 'age', 'purchase_amount']
df = pd.read_csv('customers.csv', usecols=columns_to_use)
# Now work with a much lighter dataframe
average_purchase = df.groupby('age')['purchase_amount'].mean()
print(average_purchase)
To clarify usecolsPandas only loads those three columns into memory. If your original file had 50 columns, you just cut your memory usage by about 94%.
When is this used?: If you know exactly which columns you need before loading the data.
# 3. Configure Data Types
By default, Pandas may use more memory than necessary. A column of integers may be stored as 64-bit where 8-bit will work fine.
For example, if you load a dataset with product ratings (1-5 stars) and user IDs:
import pandas as pd
# First, let's see the default memory usage
df = pd.read_csv('ratings.csv')
print("Default memory usage:")
print(df.memory_usage(deep=True))
# Now optimize the data types
df['rating'] = df['rating'].astype('int8') # Ratings are 1-5, so int8 is enough
df['user_id'] = df['user_id'].astype('int32') # Assuming user IDs fit in int32
print("nOptimized memory usage:")
print(df.memory_usage(deep=True))
By changing the measurement column from probability int64 (8 bytes per number) to int8 (1 byte per number), we get an 8x memory reduction for that column.
Common modifications include:
int64→int8,int16orint32(depending on the number range).float64→float32(if you don't need extreme precision).object→category(for columns with repeated values).
# 4. Use Categorical Data Types
If a column contains repeated text values (such as country names or product categories), Pandas stores each value separately. I category dtype stores unique values and uses function codes to refer to them.
Let's say you're working with a product inventory file where the category column has only 20 unique values, but it repeats for every row in the dataset:
import pandas as pd
df = pd.read_csv('products.csv')
# Check memory before conversion
print(f"Before: {df['category'].memory_usage(deep=True) / 1024**2:.2f} MB")
# Convert to category
df['category'] = df['category'].astype('category')
# Check memory after conversion
print(f"After: {df['category'].memory_usage(deep=True) / 1024**2:.2f} MB")
# It still works like normal text
print(df['category'].value_counts())
This modification can significantly reduce memory usage for columns with low cardinality (fewer unique values). Columns still work the same way as normal text data: you can sort, group, and sort as usual.
When is this used?: For any text column where values are repeated (sections, regions, countries, departments, and the like).
# 5. Filter While You Read
Sometimes you know you only need a subset of the rows. Instead of loading everything and sorting, you can sort during the loading process.
For example, if you only care about transactions from the year 2024:
import pandas as pd
# Read in chunks and filter
chunk_size = 100000
filtered_chunks = []
for chunk in pd.read_csv('transactions.csv', chunksize=chunk_size):
# Filter each chunk before storing it
filtered = chunk[chunk['year'] == 2024]
filtered_chunks.append(filtered)
# Combine the filtered chunks
df_2024 = pd.concat(filtered_chunks, ignore_index=True)
print(f"Loaded {len(df_2024)} rows from 2024")
We combine chunking and filtering. Each chunk is filtered before being added to our list, so we never memorize the full dataset, only the rows we want.
When is this used?: If you only need a subset of rows based on a specific condition.
# 6. Use Dask for Parallel Processing
For really large datasets, Dask provides an API similar to Pandas but handles all processing and compatibility automatically.
Here's how you can calculate a column average in a large dataset:
import dask.dataframe as dd
# Read with Dask (it handles chunking automatically)
df = dd.read_csv('huge_dataset.csv')
# Operations look just like pandas
result = df['sales'].mean()
# Dask is lazy - compute() actually executes the calculation
average_sales = result.compute()
print(f"Average Sales: ${average_sales:,.2f}")
Dask does not load the entire file into memory. Instead, it creates a plan for how to process data in chunks and uses that plan when you make a call .compute(). It can also use more CPU cores to speed up calculations.
When is this used?: If your dataset is too big for Pandas, even chunking, or if you want to process the same without writing complex code.
# 7. Sample Your Test Data
If you're just testing or testing code, you don't need the full dataset. Load the sample first.
Let's say you're building a machine learning model and want to test your pre-processing pipeline. You can sample your dataset as shown:
import pandas as pd
# Read just the first 50,000 rows
df_sample = pd.read_csv('huge_dataset.csv', nrows=50000)
# Or read a random sample using skiprows
import random
skip_rows = lambda x: x > 0 and random.random() > 0.01 # Keep ~1% of rows
df_random_sample = pd.read_csv('huge_dataset.csv', skiprows=skip_rows)
print(f"Sample size: {len(df_random_sample)} rows")
The first method loads the first N rows, which is good for quick testing. The second method randomly samples lines from the entire file, which is best for statistical analysis or when the file is sorted in a way that makes the top lines unrepresentative.
When is this used?: During development, testing, or test analysis before running your code on a full dataset.
# The conclusion
Managing large datasets does not require expert-level skills. Here's a quick summary of the strategies we discussed:
| The plan | When is it used? |
|---|---|
| Chunking |
By combining, sorting, and processing data you can't go into RAM. |
| Selecting a column |
When you need only a few columns in a wide dataset. |
| Developing a data type |
Always; do this after loading to save memory. |
| Types of sections |
For text columns with repeated values (sections, regions, etc.). |
| Sort while you read |
There you only need a subset of the lines. |
| Dask |
For very large datasets or when you want parallel processing. |
| Sampling |
During development and testing. |
The first step is to know both your data and your work. Most of the time, a combination of column selection and smart selection will get you 90% of the way there.
As your needs grow, move to more advanced tools like Dask or consider converting your data to efficient file formats like Parquet or HDF5.
Now go ahead and start working with those big datasets. Happy analysis!
Count Priya C is an engineer and technical writer from India. He loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, he works to learn and share his knowledge with the engineering community by authoring tutorials, how-to guides, ideas, and more. Bala also creates engaging resource overviews and code tutorials.



