Machine Learning

Pandas Ain't Going Anywhere: Why It's Still My Go-To for Data Wrangling

learning data science in 2020, Pandas was one of the most popular tools. Although the new tools focus on improving Panda's weaknesses in handling very large datasets, I still use Panda for many data cleaning, processing, and analysis tasks. Yes, Pandas gives me a hard time when I'm working with billions of rows, but it's definitely more than enough to work with anything less than that.

I see Pandas being used not only in EDA or notebooks but also in production applications.

In this article, I will go through some data cleaning and processing tasks to show how powerful Pandas is.

Let's start with a dataset, which contains stock keeping units (SKUs) and API search responses for these SKUs.

import pandas as pd

search_results = pd.read_csv("search_results.csv")

search_results.head()

The search result is a list of dictionaries and looks like this:

search_results.loc[0, "search_result"]

"[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}}, 
{'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}}, 
{'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}}, 
{'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}}, 
{'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}}, 
{'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}}, 
{'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}}, 
{'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}}, 
{'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}}, 
{'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}] 
... and 5 entities remaining"

As we can see from the output, it is not a proper list in dictionary format because of the last part (“… and 5 entities left”). Again, it is stored as a single string.

In order to use it better, we need to convert it to a proper dictionary list. The next line of code removes the last part by splitting the string at “…” and takes the first split.

search_results.loc[0, "search_result"].split("...")[0].strip()

However, the output is still a single string. We can use Python's built-in ast module to convert it to a list:

import ast

res = ast.literal_eval(search_results.loc[0, "search_result"].split("...")[0].strip())

res

[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}},
 {'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}},
 {'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}},
 {'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}},
 {'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}},
 {'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}},
 {'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}]

We now have search results as a proper list of dictionaries. This was only one line. We need to use the same functionality for all SKUs (ie every SKU column).

Another option is to go through all the rows in a loop and do the same operation. However, this is not the best option. We should prefer vectorized operations if we can. A vectorized function basically means executing code on all lines at once.

In one line, I used split to remove the last part of the string but it didn't work for vectorized operation. The most robust option seems to be using regex.

search_results.loc[:, 'search_result'] = search_results['search_result'].str.replace(r"....*", "", regex=True).str.strip()

This code selects “…” and everything that comes after it and replaces them with nothing. In other words, it removes the “… and the remaining 5 entities” part.

We now have all the rows in the search results column as a proper list of dictionaries.

search_results.loc[10, "search_result"]

"[{'my_id': 'HBCV00007F5Y2B', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00007UPQBM', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00008I29IH', 'distance': 1.0, 'entity': {}},
 {'my_id': 'HBCV00006U3ZYB', 'distance': 0.8961254358291626, 'entity': {}},
 {'my_id': 'HBCV0000AFA4H6', 'distance': 0.8702399730682373, 'entity': {}},
 {'my_id': 'HBCV00009CDGD4', 'distance': 0.86175537109375, 'entity': {}},
 {'my_id': 'HBCV000046336T', 'distance': 0.8594968318939209, 'entity': {}},
 {'my_id': 'HBCV00009QDZRT', 'distance': 0.8572311997413635, 'entity': {}},
 {'my_id': 'HBCV00008E11P3', 'distance': 0.8553324937820435, 'entity': {}},
 {'my_id': 'HBV00000C4IY6', 'distance': 0.8539167642593384, 'entity': {}}]"

They are still stored as strings but I can easily convert them to arrays using the ast module, which I will do in the next step.

What I'm interested in are the SKUs returned in the search results. I will create a new column by extracting the SKUs from the dictionary. I can access them using the “my_id” dictionary key.

There are 3 parts to this job:

  • Convert a string of search results to a list using the liral_eval function
  • Extract the SKU from the my_id key of the dictionary
  • Do this listwise to get SKUs from all dictionaries in the list

We can do all these operations by using a lambda function on every line like this:

search_results.loc[:, "result_skus"] = 
search_results["search_result"].apply(lambda x: [item['my_id'] for item in ast.literal_eval(x)])

search_results.head()

Each row in the result_skus column contains a list of 10 SKUs. Let's say I need to have these 10 SKUs in different rows. For each row in the sku column, there will be 10 rows created in the result_skus skus column list. There is a very simple way to do this in Pandas, which is the burst function.

data = search_results[["sku", "result_skus"]].explode("result_skus", ignore_index=True)

data.head()

We created a new data frame with the sku and result_skus columns. The diagram below shows what the burst function does:

Consider the opposite. We have a data frame as shown above but we want to have all the sku results in one row.

We can use the group function by grouping rows by sku and then using the list function on the result_skus column:

new_data = data.groupby("sku", as_index=False)["result_skus"].apply(list)

new_data.head()

This will take us back to the previous step:

Using the explode function, we created a data frame with a separate row for each sku in the sku_column. What if we need them to be divided into different columns instead of rows?

Another option is to use the pd.Series function on the result_skus column and merge the resulting columns into the original data frame.

new_cols = new_data["result_skus"].apply(pd.Series)

new_data = pd.concat([new_data, new_cols], axis=1)

new_data.head()

Columns 0 to 9 contain the 10 SKUs in the skus_column. This code that implements the function function is not a vectorized function.

We have another option, which is vectorized and very fast.

new_cols = pd.DataFrame(new_data["result_skus"].tolist())

new_data = pd.concat([new_data, new_cols], axis=1)

This code will give us the same data frame as above but much faster.

I have shown a typical data cleaning and processing task that a data scientist or analyst might encounter in their work. I've been in the field for over 5 years and Pandas is always enough to do what I need except when I'm working with very large datasets (eg billions of rows).

Tools better suited for such large datasets have a syntax similar to Pandas. For example, PySpark is kind of a hybrid of Pandas and SQL. Polars are very similar to Pandas in terms of syntax. Therefore, learning and implementing Pandas is still a very important skill for anyone working in the data science and AI domain.

Thanks for reading.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button