I am going to attempt to help you in grasping Loc and iLoc in Pandas. I set out to describe these two concepts because I and others have struggled with them. When I received my Aha! moment, I knew then it was time to write about. So let's get you over the hump of learning these two constructs!
I am assuming that since you are reading this, you are a beginner (or close to it) in Python/Pandas. You probably have been struggling with learning about loc and iloc. Based on this assumption, I am going to deviate from my usual format - that is, I am not going to include the code on GitHub or the data for that matter.
Don't Know Pandas? No worries. You can view one of the best ways to learn and get started immediately.
The examples I show are short enough for you to type. And typing in the examples yourself will help your brain with the learning process. If I give you the code, you may be tempted (like I often am) to simply try and learning from scanning the code with your eyes.
Interactive learning, especially in coding, is really the fastest way to reinforce concepts. For this reason, I am going to ask you to enter manually the code examples. You'll thank me for it later!
Scenario: Customer Sales
You are a data analyst helping out a e-commerce business owner. The owner just started out but wants to have a tool that can help them determine sales of its customers. To start, you want to help your client find specific customers. But eventually, you can aggregate sales by either region or income strata.
But first, you need to help your client find those customers. And to do that, we'll need to load sales data.
Here is the segment of code that can be used to create the DataFrame for this tutorial:
data = {
'CustomerID':['X1000', 'X1010', 'X1020', 'X1030', 'X1040', 'X1050'],
'Name':['John','Ann','Joe','Alice','Susan','Bill'],
'Age':[30, 19, 25, 53, 38, 68],
'Region':['North','North','South','East','South','West'],
'Income Strata':['High','Medium','Medium','Low','Medium','High'],
'Sales':[250, 5000, 132, 400, 780, 223]
}
row_labels = [101,102,103,104,105,106]
After you load the data, you'll need to put it into a DataFrame:
sales = pd.DataFrame(data, index=row_labels)
You may be wondering why I chose the row_labels numbers shown above (sequenced from 101-106). It's because most people use sequences starting with 0, which makes it seem like it's the same as the index positions of each of the rows.
Remember though, when you set an index like we did with row_labels, that is not the position of the index. It is just a label. This is crucial to understand!
Many tutorials will start with sequential numbers starting with 0. This is the cause of the confusion. It makes newcomers believe that the labels are the same as the position of the index and that is not the case.
Below, I will create 0-based labels and you'll see for yourself how that is not the same as the positional numbers!
Related: Don't know what a DataFrame is? See how Learning Pandas will Transform Your Data Analysis.
How to Access Data
You can probably guess by the title of this tutorial that we'll be using loc and iloc to access our data. The difference between the two is that loc[] accesses data by label and iloc[] accesses data by it's positional index. Know that labels can be numeric and in our case, that is what they are. Labels can also be text-based (more below) and dates.
To access labels of DataFrame objects, use .loc[]. To access via the index position of the objects, use .iloc[]. To get the row corresponding to the index lable 101:
sales.loc[101]
NOTE: Wherever the row containing the label 101 lands up, using this operation will always locate that row. For instance, if you sort the DataFrame and the row with 101 as the index label lands up in say, the third index position, this operation will still retrieve that row (which is now in the third row.
To get the first row (positional) no matter how you alter or sort the DataFrame:
sales.iloc[0]
The way the sales DataFrame is configured currently i.e., no sorting or alterations, these two operations will return the same record, i.e., the first one.
Challenge: For the current configuration of this DataFrame, what happens when you try to access the following:
sales.loc[0]
{Scroll down for answer}
As you can see, you get an error. Hopefully, you can see why. There is no index with the label 0. And .loc[] accesses labels. The only labels available in this current configuration are 101-106.
.iloc[0] will work because it gets the first record of the dataframe no matter what is there and no matter what label was defined.
If this isn't clear, don't worry. We're not done yet. The next part of the tutorial should bring it home.
What we are going to do now is to create the labels as 0,1,2,3,4,5. And we'll access the label 0 and the position index 0. Without doing anything else to the dataframe, both of these instructions should point to the same first record.
First, we'll create a new DataFrame (from the same original data) called sales1 with the 0-based location index:
row_labels = [0,1,2,3,4,5]
sales1 = pd.DataFrame(data, index = row_labels)
Two items to observe here. First, we can now access .loc[0]. Why? Because it is part of the index labels that we defined for sales1, i.e., 0-5.
Second, as specified, .loc[0] and .iloc[0] point to the same record. Check!
Next, let's sort the data so from customerID descending. This will essentially reverse the label index. But see if you can guess what it will do to the positional index. Don't worry. That's what this exercise is all about - to help you see what will happen.
Let's create a new DataFrame (sales2) that is a copy of sales1 but sorted by CustomerID descending:
sales2 = sales1.sort_values(by = "CustomerID", ascending = Fales).copy()
You'll notice the code fragment shows that I have not run the .loc[0] or the .loc[1]. I want you to try to guess based on the data which rows each of these will display. Remember .loc[] is for index labels and .iloc[] is for positional index. Hint: I stated earlier that .iloc[0] will ALWAYS point to the first row now matter what you do to the DataFrame (like sort it). However, is the same true for .loc[0]?
Here is the result of both instructions:
The two insructions point to markedly different records. The .loc[0] points to the last record in the DataFrame (because of the sort) and the .iloc[0] points to where? You guessed it - the first row as it always will. So whatever row lands up in the first row due to the operations you perform on the DataFrame, .iloc[0] will access this first record for you. But the positional index of the label-based index will depend on what type of operation you perform on the DataFrame.
Related: 10 minutes to Pandas
For sales1 and sales2, what specific row did .loc[0] return? It returned same row that contained the label 0. But for sales, the label 0 did not exist. Instead, the label for that same record was 101.
Let's try this exercise again and this time we'll copy the sales2 DataFrame but sort it by Sales descending. But before we do, let's go over the guidelines set out in this tutorial thus far:
.loc[0] will return the row that contains the row label called 0. In sales1 and sales2, this corresponded to CustomerID X1000 with name of John who is age 30 in the North region. John is also categorized as a High income individual but he is only responsible for $200 in sales. It did not matter that sales1 was sorted by the label index and sales2 was sorted by CustomerID descending. .loc[0] always found CustomerID X1000 (which corresponds to row label 0).
.iloc[0] will always return the first row of the DataFrame - irrespective of what you do to the DataFrame (sort, aggregate, etc.) What will exist in this row will depend on the operations you perform. But it will always return that first record.
Based on these guidelines, what do you believe will be returned for .loc[0] and .iloc[0]? Once again, I am not showing the results yet so that you can think about what will be displayed:
Are you ready for the results?
As expected, the .loc[0] returned that same customer X1000. And since the very first record of this new sort is customer X1010, that sure is the record that is shown when calling .iloc[0].
Next Steps
Hopefully, you got this on the first try. I think this tutorial should give you a leg up on how these two concepts are supposed to work (loc and iloc). But if you didn't catch it this time, please feel free to go through this tutorial a few times until you do get it.
But if you feel comfortable, try guessing what would happen if you tried different indexes like .loc[1] and .iloc[3]. Do they return what you think they should?
You can also try to add text-based labels or even better, make the CustomerID the index. In fact, let's do that, shall we?
Use the set_index("CustomerID") to set the index to the Customer ID. Once again, see if you can guess each of the different .loc and .iloc operations. You should be able to nail this by now!
Let's start with the .loc[0]. Will this work?
If you did not guess that this operation would fail, you may want to run through this tutorial from the start. But I am guessing that you picked right up on it.
The rest of the items will not bomb out with an error (to give you a hint!)
Hopefully, you nailed these as well!
When you grasp these concepts, you open up a new world for your data analysis. There is more to these commands than is covered here. But when you get your Aha! moment, those other concept will fall into place much quicker. It certainly did for me!
Learn Pandas Right Now
When you learn Pandas, you open up doors to companies looking for the in-demand skill. Click on the button to learn about a resource that will get you up to speed, quickly!