Machine Learning

The Rule Everyone Misses: How to Stop Confusion with location and yiloc in Pandas

with pandas, you've probably stumbled upon this classic confusion: you have to use loc or iloc extract data? At first glance, they look almost identical. Both are used to cut, sort, and extract rows or columns from a DataFrame – yet one small difference in how they work can completely change your results (or throw an error that leaves you scratching your head).

I remember the first time I tried to select a row with df.loc[0] and wondered why it didn't work. The reason? Pandas don't always “think” in terms of hierarchy — sometimes they use labels. That's it loc vs iloc the difference comes in.

In this article, I'll walk through a simple little project using a small dataset of student performance. In the end, you just won't understand the difference between them loc again ilocbut also know exactly when to use each in your data analysis.

We present the dataset

The dataset is from ChatGPT. It contains some basic records of student evaluations. Here is a summary of our dataset

import pandas as pd
df = pd.read_csv(‘student_scores.csv’)
df

Output:

I will try to do some data extraction operations using loc and yloc, like

  • Selects a single row from a DataFrame
  • It is worth one price
  • It deletes many lines
  • To cut the width of the lines
  • To exclude certain columns and
    Logical filtering

First, let me briefly explain what loc and loc are in Pandas.

What is loc and loc

Loc again iloc data extraction techniques in Pandas. They are very useful for selecting data from records.

Loc uses labels to retrieve records in a DataFrame, so I find it easy to use. Iloc, however, is useful for more accurate retrieval of records, because iloc selects data based on the absolute positions of rows and columns, similar to how you would index a Python list or array.

But if you're like me, you might be wondering. If the area is clearly simplified due to line labels, why bother using iloc? Why bother trying to find row indexes, especially when dealing with large data sets? Here are a few reasons.

  • Many times, data sets do not come with clean row indexes (like 101, 102, …). Instead, you have a clear index (0, 1, 2, …), or you may misspell row labels when downloading records. In this case, it is better to use iloc. Later in this article, it's something we'll be talking about again.
  • In some cases, such as machine learning preprocessing, labels are not really important. You only care about the data summary. For example, the first or last three records. iloc is really useful in this situation. iloc make the code shorter and shorter, especially if the labels change, which can break your machine learning model
  • Many datasets have duplicate row labels. In this case, iloc it always works as the positions are different.
  • The bottom line is, use loc when your dataset has clear, meaningful labels and you want your code to be readable.
  • Use iloc when you need a location-based control, or when labels are missing/dirty.

Now that I've cleared my breath, here's the basic syntax for loc and yloc below:

df.loc[rows, columns]
df.iloc[rows, columns]

The syntax is almost identical. For this exercise, let's try to retrieve some records using loc and yloc.

Selects a single row from a DataFrame

To make a proper display, let's first change the column index and make it student_id. Currently, pandas is auto-indexing:

# setting student_id as index
df.set_index('student_id', inplace=True)

Here is the output:

It looks better. Now, let's try to retrieve all of Bob's records. Here's how to deal with that using loc:

df.loc[102]

What I'm doing here is specifying a line label. This should return all of Bob's records.

Here is the output:

name   Bob
math    58
english 64
science 70
Name: 102, dtype: object

The nice thing about this is that I can drill down, like a hierarchy. For example, let's try to retrieve some information about Bob, such as his math score.

df.loc[102, ‘math’]

The output can be 58.

Now let's try this using iloc. If you are familiar with lists and arrays, indexing always starts at 0. So if I want to retrieve the first record in the DataFrame, I will have to specify the index 0. In this case, I'm trying to retrieve Bob, which is the second row in our DataFrame – so, in this case, the index will be 1.

df.iloc[1]

We will get the same output as above:

name   Bob
math    58
english 64
science 70
Name: 102, dtype: object

And if I try to drill down and retrieve Bob's math score. Our index will also be 1, since the math is on the second row

df.iloc[1, 1]

The output can be 58.

Okay, I can wrap this topic up here, but loc and yloc offer some very impressive features. Let's run through some of them.

Remove Multiple Lines (Special Readers)

Pandas allows you to retrieve multiple rows using loc and yloc. I will make a demonstration by obtaining the records of many students. This time, instead of storing a single value in our loc/iloc method, we'll be storing an array. Here's how to do that with loc:

# Alice, Charlie and Edward's records
df.loc[[101, 103, 105]]

Here is the output:

And here's how to do that iloc:

df.iloc[[0, 2, 4]]

We will get the same output:

I hope you get your chance.

Cut Range of Lines

Another useful feature offered by Python Pandas is the ability to cut a range of lines. Here, you can specify your starting and ending point. Here is the syntax for loc/iloc slicing:

df.loc[start_label:end_label]

In lochowever, an end label will be included in the output – which is very different from Python's default truncation.

The syntax is the same as for ilocexcept that the end label will be removed from the output (like Python's default truncation).

Let's go through an example:

I am trying to retrieve a list of student records. Let's try that out using loc:

df.loc[101:103]

Output:

As you can see above, the last label is included in the result. Now, let's try that in action iloc. If you remember, the index of the first line will be 0, which means the third line will be 2.

df.iloc[0:3]

Output:

Here, the third line is not included. But if you're like me (a questioner), you might be wondering, why would you want the last line removed? In what situations would it be useful? What if I told you that it actually makes your life easier? Let's get that out of the way quickly.

Assuming you want to process your DataFrame in chunks of 100 rows each.

If the cut was combined, you would have to do some weird math to avoid repeating the last row.

But because the cut is special at the end, you can do this easily, as well.

df.iloc[0:100] # first 100 rows
df.iloc[100:200] # next 100 rows
df.iloc[200:300] # next 100 rows

Here, there will be no overlap, and there will be fixed chunk sizes. Another reason is how similar the scope works in Pandas. Usually, if you want to retrieve the range of rows, it also starts at 0 and does not include the last row. Having this common sense in loc cutting is really helpful, especially when you're working on some web scraping or breaking a range of lines.

Remove Vertical Columns (Headings)

I would also like to introduce you to the colony : a symbol. This allows you to retrieve all the records in your DataFrame using a field. Look like- in SQL. The nice thing about this is that you can filter and extract a subset of columns.

This is usually where I find myself starting. I use it to get an overview of a particular dataset. From there, I can start sorting and drilling down. Let me show you what I mean.

Let's return all the records:

df.loc[:]

Output:

From here, I can extract some such columns. By location:

df.loc[:, [‘math’, ‘science’]]

Output:

With iloc:

df.iloc[:, [2, 4]]

The output will be the same.

I like this feature because it is flexible. Let's say I want to retrieve Alice and Bob's math and science scores. It will go something like this. I can just specify the range of records and columns I want.

With loc:

df.loc[101:103, ['name', 'math', 'science']]

Output:

With iloc:

df.iloc[0:3, [0, 1, 3]]

We will get the same output.

Boolean Sorting (Who got over 80 in Math?)

The last feature I want to share with you is Boolean filtering. This allows for a flexible background. Let's say I want to retrieve the records of students who scored more than 80 points in Math. Normally, in SQL, you would have to use WHERE and HAVING clauses. Python makes this easy.

# Students with Math > 80.
df.loc[df['math'] > 80]

Output:

You can also filter on multiple conditions using the AND(&), OR(|), and NOT(~) operators. For example:

# Math > 70 and Science > 80
df.loc[(df[‘math’] > 70) & (df[‘science’] > 80)]

Output:
PS I wrote an article about sorting with Pandas. You can read it here

Usually, you will find yourself using this feature with loc. It can be a bit complicated ilocas it does not support Boolean conditions. To do this with iloc, you'll need to convert the Boolean filter to an array, like this:

# Students with Math > 80.
df.iloc[list(df['math'] > 80)]

To avoid headaches, just go with it loc.

The conclusion

You will probably use the loc again iloc multiple methods when working on a dataset. So it is important to know how they work and differentiate between the two. I like how easy and flexible it is to release records in these ways. Whenever you are confused, just remember that loc is about labels while loc is about positions.

I hope you found this article helpful. Try running these examples on your own dataset to see the difference in action.

I write these articles as a way to test and strengthen my understanding of technical concepts – and to share what I'm learning with others who may be on the same path. Feel free to share with others. Let's learn and grow together. Hello!

Feel free to say hello on any of these forums

In the middle

LinkedIn

Twitter

YouTube

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button