Machine Learning

Stop Writing Dirty Boolean Masks: 10 Best Ways to Filter Pandas DataFrames

I have discussed how to create your first DataFrame using Pandas. I mentioned that the first thing you need to understand well is Data structures and frameworks before moving on to data analysis with Python.

Pandas is an excellent library for data manipulation and retrieval. Combine it with Numpy and Seaborne, and you've got yourself some data analysis power.

In this article, I will be going with you efficient ways to filter data in pandasstarting with simple forms and progressing to powerful methods such as .isin(), .str.startswith()again .query(). In the end, you will have a toolkit of filtering techniques that you can apply to any dataset.

Without further ado, let's get into it!

Importing our data

Ok, to start, I'm going to import our pandas library

# importing the pandas library
import pandas as pd

That is the only library I will need for this application

Next, I will import the dataset. The dataset is from ChatGPT, btw. Contains basic sales records. Let's take a look at our dataset.

# checking out our data
df_sales = pd.read_csv('sales_data.csv')
df_sales

Here is a preview of the data

It contains basic sales records with columns for OrderId, Customer, Product, Category, Quantity, Price, Order Date and Region.

Alright, let's start our sorting!

Filtering by one criterion

Let's try to select all the records in a certain category. For example, I want to know how many unique orders are made in the Electronics category. To do that, it's very straightforward

# Filter by a single condition
# Example: All orders from the “Electronics” category.
df_sales[‘Category’] == ‘Electronics’

In Python, you need to distinguish between = operators and == the operator.

= is used to assign a value to a variable.

For example

x = 10 # Assigns the value 10 to the variable x

== on the other hand it is used to compare two values ​​together. For example

a = 3
b = 3
print(a == b) # Output: True

c = 5
d = 10
print(c == d) # Output: False

That said, let's apply the same idea to the filter I did above

# Filter by a single condition
# Example: All orders from the “Electronics” category.
df_sales[‘Category’] == ‘Electronics’

Here, I'm actually telling Python to search our entire record to find a category called Electronics. If it finds a match, it displays a Boolean result, True or False. Here is the result

As you can see. We get the Boolean output. True means Electronics exists, and False means the latter. This is fine and all, but it can get confusing when you're dealing with a large number of records. Let's fix that.

# Filter by a single condition
# Example: All orders from the “Electronics” category.
df_sales[df_sales[‘Category’] == ‘Electronics’]

Here, I just wrapped the state in a DataFrame. And with that, we get this output

So much better, right? Let's continue

Sort rows by numerical order

Let's try to retrieve records where the order value is greater than 2. It's straight forward.

# Filter rows by numeric condition
# Example: Orders where Quantity > 2
df_sales[‘Quantity’] > 2

Here, I use the greater than > operator. Similar to our output above, we will get a Boolean result with true and false values. Let's fix it as soon as possible.

And where we are going!

Sort by date status

Sorting by date is straightforward. For example.

# Filter by date condition
# Example: Orders placed after “2023–01–08”
df_sales[df_sales[“OrderDate”] > “2023–01–08”]

This checks for orders placed after January 8, 2023. And here is the output.

The nice thing about Pandas is that it automatically converts string data types to dates. In cases where you encounter an error. You may want to convert the date before filtering using the to_datetime() work. Here is an example

df[“OrderDate”] = pd.to_datetime(df[“OrderDate”])

This converts our OrderDate column to a date data type. Let's kick things up a notch.

Filtering by Multiple Conditions (AND, OR, NOT)

Pandas enables us to filter in many cases using logical operators. However, these operators are different from Python's built-in operators such as (and, or, not). Here are the logical operators that you will work with the most

& (Makes sense TOO)

ampersand (&) The symbol represents AND for pandas. We use this when we are trying to fulfill two conditions. In this case, both conditions must be true. For example, let's return orders in the category “Furniture” where Price > 500.

# Multiple conditions (AND)
# Example: Orders from “Furniture” where Price > 500
df_sales[(df_sales[“Category”] == “Furniture”) & (df_sales[“Price”] > 500)]

Let's break this down. Here, we have two conditions. One that returns the orders in the Furniture category and one that sorts by values ​​> 500. Using &, we are able to combine both conditions.

Here is the result.

One record was found. If you look at it, it meets our condition. Let's do the same for OR

| (logical OR)

I |,The vertical bar symbol is used to represent OR in pandas. In this case, at least one of the corresponding elements must be True. For example, let's retrieve records with orders from the “North” OR “East Region”.

# Multiple conditions (OR)
# Example: Orders from “North” region OR “East” region.
df_sales[(df_sales[“Region”] == “North”) | (df_sales[“Region”] == “East”)]

Here is the output

Sort by isin()

Let's say I want to retrieve orders from multiple customers. I could always use the & operator. For example

df_sales[(df_sales[‘Customer’] == ‘Alice’) | (df_sales[‘Customer’] == ‘Charlie’)]

Output:

There is nothing wrong with that. But there is a better and easier way to do this. That by using the isin() work. Here's how it works

# Orders from customers ["Alice", "Diana", "James"].
df_sales[df_sales[“Customer”].isin([“Alice”, “Diana”, “James”])]

Output:

The code is very simple and clean. Using the isin() function, I can add as many parameters as I want. Let's move on to something else advanced filtering.

Sort using string matching

One of Panda's powerful but underutilized functions is string matching. It facilitates many data cleaning tasks when you are trying to search for patterns in the records in your DataFrame. It is similar to the LIKE operator in SQL. For example, let's retrieve customers whose names start with “A”.

# Customers whose name starts with "A".
df_sales[df_sales[“Customer”].str.startswith(“A”)]

Output:

Pandas gives it to you .str accessory to use string functions. Here is another example

# Products ending with “top” (e.g., Laptop).
df_sales[df_sales[“Product”].str.endswith(“top”)]

Output:

Filter using the query() method

If you are coming from SQL background, this method can be very useful for you. Let's try to retrieve the orders in the electronic section where the maximum > 2. It can always go like this.

df_sales[(df_sales[“Category”] == “Electronics”) & (df_sales[“Quantity”] >= 2)]

Output:


But if you are someone trying to bring your SQL sauce. This will work for you instead

df.query(“Category == ‘Electronics’ and Quantity >= 2”)

You will get the same output as above. It's a lot like SQL if you ask me, and you'll be able to get away with it a symbol. I will use this method often.

Sort by column values ​​in the range

Pandas allows you to return a range of values. For example, Orders where the Price is between 50 and 500 would go like this

# Orders where the Price is between 50 and 500
df_sales[df_sales[“Price”].between(50, 500)]

Output:

It's straight forward.

Sort by missing values ​​(NaN)

This is probably the most useful task because, as a data analyst, one of the data cleaning tasks you will be working on the most is filtering out missing values. Doing this in Pandas is straightforward. That by using the notna() work. Let's filter the rows where Price is not null.

# filter rows where Price is not null.
df_sales[df_sales[“Price”].notna()]

Output:

And where you are going. I don't really see the difference, though, but I hope it's done.

The conclusion

The next time you open a dirty CSV and wonder “Where do I start?”try to filter first. It's a quick way to cut through the noise and find the story hidden in your data.

Switching to Python for data analysis once felt like a big step, coming from a SQL background. But for some reason, Pandas looked simple and I didn't spend time sorting the data
The good part about this is that these same techniques work regardless of the data set – sales numbers, survey responses, web analytics, you name it.

I hope you found this article helpful.

I write these articles as a way to test and strengthen my understanding of technical concepts – and to share what I'm learning with others who may be on the same path. Feel free to share with others. Let's learn and grow together. Hello!

Feel free to say hello on any of these forums

In the middle

LinkedIn

Twitter

YouTube

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button