ANI

All About Pyjanitor's Method Chaining, and Why It's Useful


Photo by Editor

# Introduction

Working deeply with data in Python teaches us all an important lesson: cleaning data often doesn't sound like doing data science, but more like working as a digital curator. Here's what's needed in most use cases: loading a dataset, finding a lot of dirty column names, meeting missing values, and dealing with temporary data variables, only the last one containing your final, clean dataset.

Pyjanitor provides a cleaner way to perform these steps. This library can be used in conjunction with the concept of combining methods to transform tedious data cleaning processes into pipelines that look good, work well and are readable.

This article shows how and how to reduce the integration method in the Pyjanitor context and clean the data.

# Understanding Method Chaining

Clustering is nothing new in the field of programming: in fact, it is a well-established coding pattern. It consists of calling multiple methods in sequence on an object: all with just one statement. This way, you don't need to reassign the variable after each step, because each method returns an object that calls the next attached method, and so on.

The following example helps to understand the concept at its core. See how we can apply a few simple transformations to a small piece of text (a string) using “regular” Python:

text = "  Hello World!  "
text = text.strip()
text = text.lower()
text = text.replace("world", "python")

The resulting value in the script will be: "hello python!".

Now, by combining the method, the same process will look like this:

text = "  Hello World!  "
cleaned_text = text.strip().lower().replace("world", "python")

Note that the logical flow of the applied operation is from left to right: all in one, unified chain of thought!

If you got it, now you have a good understanding of the concept of path integration. Let's translate this idea now into the context of data science Pandas. A typical data cleaning in a data frame, which includes many steps, usually looks like this without binding:

# Traditional, step-by-step Pandas approach
df = pd.read_csv("data.csv")
df.columns = df.columns.str.lower().str.replace(' ', '_')
df = df.dropna(subset=['id'])
df = df.drop_duplicates()

As we will see shortly, by using the join method, we will create a join pipeline where dataframe operations are joined using parentheses. In addition, we will no longer need intermediate variables that contain non-final data frames, allowing for cleaner, more bug-tolerant code. And (again) on top of that, Pyjanitor makes this process seamless.

# Installing Pyjanitor: An example application

Pandas itself provides native support for the compilation method to some extent. However, some of its key functions were not strictly designed with this pattern in mind. This is the main motivation why Pyjanitor was born, based on an R package that is almost identical in name: janitor.

Basically, Pyjanitor can be framed as a Pandas extension that delivers a package of custom data cleaning routines in a chaining-friendly way. Examples of application programming interface (API) method names include clean_names(), rename_column(), remove_empty()and so on. Its API uses a collection of intuitive method names that take coding expression to a new level. In addition, Pyjanitor relies entirely on open source, free tools, and can be run seamlessly in cloud environments and libraries, such as Google Colab.

Let's fully understand how the chaining method in Pyjanitor is used, with an example when we first create a small, synthetic dataset that looks intentionally messy, and import it into Pandas. DataFrame something.

IMPORTANT: to avoid common, but terrible errors due to incompatibilities between library versions, make sure you have the latest version available for both Pandas and Pyjanitor, by using !pip install --upgrade pyjanitor pandas first.

messy_data = {
    'First Name ': ['Alice', 'Bob', 'Charlie', 'Alice', None],
    '  Last_Name': ['Smith', 'Jones', 'Brown', 'Smith', 'Doe'],
    'Age': [25, np.nan, 30, 25, 40],
    'Date_Of_Birth': ['1998-01-01', '1995-05-05', '1993-08-08', '1998-01-01', '1983-12-12'],
    'Salary ($)': [50000, 60000, 70000, 50000, 80000],
    'Empty_Col': [np.nan, np.nan, np.nan, np.nan, np.nan]
}

df = pd.DataFrame(messy_data)
print("--- Messy Original Data ---")
print(df.head(), "n")

Now we define a series of Pyjanitor methods that perform series processing on both the column names and the data itself:

cleaned_df = (
    df
    .rename_column('Salary ($)', 'Salary')  # 1. Manually fix tricky names BEFORE getting them mangled
    .clean_names()                          # 2. Standardize everything (makes it 'salary')
    .remove_empty()                         # 3. Drop empty columns/rows
    .drop_duplicates()                      # 4. Remove duplicate rows
    .fill_empty(                            # 5. Impute missing values
        column_names=['age'],               # CAUTION: after previous steps, assume lowercase name: 'age'
        value=df['Age'].median()            # Pull the median from the original raw df
    )
    .assign(                                # 6. Create a new column using assign
        salary_k=lambda d: d['salary'] / 1000
    )
)

print("--- Cleaned Pyjanitor Data ---")
print(cleaned_df)

The code above is self-explanatory, with inline comments explaining each method that is called at every step of the chain.

This is the output of our example, comparing the original dirty data with the cleaned version:

--- Messy Original Data ---
  First Name    Last_Name   Age Date_Of_Birth  Salary ($)  Empty_Col
0       Alice       Smith  25.0    1998-01-01       50000        NaN
1         Bob       Jones   NaN    1995-05-05       60000        NaN
2     Charlie       Brown  30.0    1993-08-08       70000        NaN
3       Alice       Smith  25.0    1998-01-01       50000        NaN
4         NaN         Doe  40.0    1983-12-12       80000        NaN 

--- Cleaned Pyjanitor Data ---
  first_name_ _last_name   age date_of_birth  salary  salary_k
0       Alice      Smith  25.0    1998-01-01   50000      50.0
1         Bob      Jones  27.5    1995-05-05   60000      60.0
2     Charlie      Brown  30.0    1993-08-08   70000      70.0
4         NaN        Doe  40.0    1983-12-12   80000      80.0

# Wrapping up

Throughout this article, we've learned how to use the Pyjanitor library to automate and simplify complex data cleaning processes. This makes the code clean, clear, and – so to speak – self-writing, so that other developers or your future can read the pipeline and easily understand what's going on in this journey from raw dataset to ready.

Good job!

Iván Palomares Carrascosa is a leader, author, speaker, and consultant in AI, machine learning, deep learning and LLMs. He trains and guides others in using AI in the real world.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button