Tuesday, December 5, 2023

[FIXED] Why use loc in Pandas?

December 05, 2023 pandas, pandas-loc, python, series No comments

Issue

Why do we use loc for pandas dataframes? it seems the following code with or without using loc both compiles and runs at a similar speed:

%timeit df_user1 = df.loc[df.user_id=='5561']

100 loops, best of 3: 11.9 ms per loop

%timeit df_user1_noloc = df[df.user_id=='5561']

100 loops, best of 3: 12 ms per loop

So why use loc?

Edit: This has been flagged as a duplicate question. But although pandas iloc vs ix vs loc explanation? does mention that

you can do column retrieval just by using the data frame's __getitem__:
df['time']    # equivalent to df.loc[:, 'time']

it does not say why we use loc, although it does explain lots of features of loc. But my specific question is: why not just omit loc altogether? For this question, I have accepted a very detailed answer below.

Also in the above post, the answer (which I do not think is an answer) is really well hidden in the discussion, and any person searching for what I was, would find it hard to locate the information and would be much better served by the answer provided to my question here.

Solution

Explicit is better than implicit.

df[boolean_mask] selects rows where boolean_mask is True, but there is a corner case when you might not want it to: when df has boolean-valued column labels:
```
In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df
Out[229]: 
   False  True 
0      3      1
1      4      2
2      5      3
```
You might want to use df[[True]] to select the True column. Instead it raises a ValueError:
```
In [230]: df[[True]]
ValueError: Item wrong length 1 instead of 3.
```
Versus using loc:
```
In [231]: df.loc[[True]]
Out[231]: 
   False  True 
0      3      1
```
In contrast, the following does not raise ValueError even though the structure of df2 is almost the same as df1 above:
```
In [258]: df2 = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]}); df2
Out[258]: 
   A  B
0  1  3
1  2  4
2  3  5

In [259]: df2[['B']]
Out[259]: 
   B
0  3
1  4
2  5
```
Thus, df[boolean_mask] does not always behave the same as df.loc[boolean_mask]. Even though this is arguably an unlikely use case, I would recommend always using df.loc[boolean_mask] instead of df[boolean_mask] because the meaning of df.loc's syntax is explicit. With df.loc[indexer] you know automatically that df.loc is selecting rows. In contrast, it is not clear if df[indexer] will select rows or columns (or raise ValueError) without knowing details about indexer and df.
df.loc[row_indexer, column_index] can select rows and columns. df[indexer] can only select rows or columns depending on the type of values in indexer and the type of column values df has (again, are they boolean?).
```
In [237]: df2.loc[[True,False,True], 'B']
Out[237]: 
0    3
2    5
Name: B, dtype: int64
```
When a slice is passed to df.loc the end-points are included in the range. When a slice is passed to df[...], the slice is interpreted as a half-open interval:
```
In [239]: df2.loc[1:2]
Out[239]: 
   A  B
1  2  4
2  3  5

In [271]: df2[1:2]
Out[271]: 
   A  B
1  2  4
```

Answered By - unutbu

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, December 5, 2023

[FIXED] Why use loc in Pandas?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels