I Spent An Hour On Data Preprocessing Before Querying Gemini

0 5 5 minutes read

I Spent An Hour On Data Preprocessing Before Querying Gemini

important time for the preparation of the data for the activities below. Whether it involves data cleansing, handling missing values, feature engineering, data preprocessing, or post processing, this phase requires a lot of time.

So, I was working on this post-processing task where I needed to create a new column in a Pandas DataFrame by extracting values from an existing column, based on data from two other columns.

I could have directly asked the LLM to write the code (which I usually do) but this time I wanted to do it myself. It was early morning and I had a fresh mind so I was in a position to handle complex data tasks.

Here's what I had to do. I had a DataFrame too predicted_categories, pred_category_idagain text_predicted_probs columns.

Prices in predicted_categories column lists the five categories in the format “category_id” – “category_description”.

['80814001 - Freze Uçları', 
 '13003106 - Freze', 
 '80805004 - Sanayi Makineleri', 
 '13003144 - Torna Makinesi', 
 '13003195 - Kumpas']

I text_predicted_probs column has the predicted probabilities of these five categories respectively.

[0.943, 0.018, 0.008, 0.006, 0.004]

Therefore, the initial value of text_predicted_probs first class opportunities at predicted_categoriesand so on.

I pred_category_id The column shows the class id predicted for the other model. What I need is the predicted probabilities of the category in pred_category_id column.

I need to get this order pred_category_id of predicted_categories column and take its value from test_predicted_probs column.

The diagram below shows what I want to achieve:

In this diagram, I want to find the probability of category 13003106, which is second in the list and its corresponding probability value is 0.018.

If we asked for a Gemini, or another advanced model, we would probably get an answer in seconds. But, I wanted to do it alone first and ask Gemini.

Let's start by reading the dataset into a Pandas DataFrame.

import pandas as pd

results = pd.read_csv("prediction_results.csv")

Prices in predicted_categories column lists strings with category ids and category names:

results.loc[0, "predicted_categories"]
# output: "['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri', '13003144 - Torna Makinesi', '13003195 - Kumpas']"

The list is however stored as a string so we first convert it to a list object using i literal_eval built-in function ast Python module:

ast.literal_eval(results.loc[0, "predicted_categories"])
# output:
['80814001 - Freze Uçları',
 '13003106 - Freze',
 '80805004 - Sanayi Makineleri',
 '13003144 - Torna Makinesi',
 '13003195 - Kumpas']

To extract the section ids, we can split each string in this list at the character “-” and select the first part after the split. Since we have a list with five categories, we must do this in the list direction as follows:

[category.split("-")[0].strip() 
for category in ast.literal_eval(results.loc[0, "predicted_categories"])]
# output:
['80814001', '13003106', '80805004', '13003144', '13003195']

We have created one value (ie one row) for you. To do the same job on all of them predicted_categories column, we can use list comprehensions. It will be a list comprehension inside another list comprehension (ie a nested list comprehension):

results.loc[:, "predicted_category_ids"] = [
    [category.split("-")[0].strip() for category in ast.literal_eval(predicted_categories)] 
    for predicted_categories in results["predicted_categories"]
]

Now we have the category ids extracted from the predicted categories column:

The next step is to check the order of the categories in the list of predicted category ids. We will then use this order to extract the predicted class probabilities.

The Python list object has an index method, which returns the index (ie the order) of the element in the list.

results.loc[0, "predicted_category_ids"]
# output:
['80814001', '13003106', '80805004', '13003144', '13003195']


results.loc[0, "predicted_category_ids"].index("13003106")
# output:
2

Once I get the id index of the predicted category, I can use it to find the probability of this category id text_predicted_probs column:

What we need to do:

Get a clue of pred_category_id of predicted_category_ids
Use this indicator to extract the correct value text_predicted_probs

These steps can be done in one operation by assigning these three columns. Let's check it on the first line:

for i, j, k in zip(results["pred_category_id"][:1], results["predicted_category_ids"][:1], results["text_predicted_probs"][:1]):
    print(j.index(str(i))) # get the index of pred_category_id in predicted_category_ids
    print(ast.literal_eval(k)[j.index(str(i))]) # get the value at this index in text_predicted_probs

# output:
0
0.943

We can confirm the output in the above screenshot. I pred_category_id is 80814001, which is the first (ie index = 0) in predicted_category_ids and the initial probability value is 0.943.

Lists in text_predicted_probs column are stored as strings so we use i literal_eval function to convert itself to a list item.

We can perform this operation on every DataFrame using a list comprehension as follows:

results.loc[:, "pred_category_prob"] = [
    float(ast.literal_eval(k)[j.index(str(i))]) if str(i) in j else 0
    for i, j, k in zip(results["pred_category_id"], results["predicted_category_ids"], results["text_predicted_probs"])
]

Basically we turned the loop in the previous code block into a list comprehension. I added a check only “if str(i) in j else 0” because there are lines with pred_category_id that is not in predicted_category_ids. In such cases, we set the probability value to 0.

The job is done as we can see below:

How did Gemini do this?

After working for about an hour (including breaks) I finished the job. Then, I wanted to ask Gemini to see how it would handle the same task.

I wrote a brief explaining my goal and included a small sample of data.

Here's what Gemini produced on its first attempt:

import pandas as pd

# 1. Create the sample DataFrame
data = {
    'predicted_categories': [
        ['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri', '13003144 - Torna Makinesi', '13003195 - Kumpas']
    ],
    'pred_category_id': [80814001],
    'text_predicted_probs': [
        [0.943, 0.018, 0.008, 0.006, 0.004]
    ]
}

df = pd.DataFrame(data)

# 2. Define a function to extract the probability matching the category ID
def get_matching_prob(row):
    # Convert ID to string for matching
    target_id_str = str(row['pred_category_id'])
    
    # Iterate through the categories to find the matching index
    for index, category in enumerate(row['predicted_categories']):
        if category.startswith(target_id_str):
            # Return the corresponding probability from the same index
            return row['text_predicted_probs'][index]
            
    # Return None (or 0) if no match is found to prevent the code from crashing
    return None

# 3. Create the new column
df['pred_category_prob'] = df.apply(get_matching_prob, axis=1)

# Display the result
print(df[['pred_category_id', 'pred_category_prob']])

It first creates a sample DataFrame using the data I provided in the notification. Then, write a function that does exactly what I described.

It took me about an hour to write the code myself. The Gemini completed the same task in seconds based on a notification that took me just a few minutes to write.

However, there is a small problem with the Gemini code. It uses the apply function, which is not vectorized. In the case of working with large datasets, using a function (basically a loop) can increase the running time.

But the solution is simple. I asked Gemini to write a vectorized version of it and here's what I got:

import pandas as pd

# Sample DataFrame
data = {
    'predicted_categories': [
        ['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri'],
        ['13003144 - Torna Makinesi', '13003195 - Kumpas']
    ],
    'pred_category_id': [80814001, 13003195],
    'text_predicted_probs': [
        [0.943, 0.018, 0.008],
        [0.6, 0.4]
    ]
}
df = pd.DataFrame(data)

# 1. Explode both list columns simultaneously to maintain alignment between category and probability
df_exploded = df.explode(['predicted_categories', 'text_predicted_probs'])

# 2. Extract the numeric ID from the category string using vectorized regex
df_exploded['extracted_id'] = df_exploded['predicted_categories'].str.extract(r'^(d+)').astype(float)

# 3. Filter for rows where the extracted ID matches the target 'pred_category_id'
matched = df_exploded[df_exploded['extracted_id'] == df_exploded['pred_category_id']]

# 4. Dedup the index (safety net in case an ID appears twice within the same list)
matched = matched[~matched.index.duplicated(keep='first')]

# 5. Map the extracted probability column back to the original DataFrame using the index
df['pred_category_prob'] = matched['text_predicted_probs']

df

The second solution worked fine and looked simpler than the code I wrote.

So, I spent about an hour on an assignment that an LLM could complete in less than five minutes. However, if I didn't know how Pandas works, I would have accepted the first solution, which was not enough. It's a great example of how LLMs can increase productivity, but only if you know what you're doing.

Source link

nimda 3 weeks ago

0 5 5 minutes read