Issue
The operation pandas.DataFrame.lookup
is "Deprecated since version 1.2.0", and has since invalidated a lot of previous answers.
This post attempts to function as a canonical resource for looking up corresponding row col pairs in pandas versions 1.2.0 and newer.
Some previous answers to this type of question (now deprecated):
- Vectorized lookup on a pandas dataframe
- Python Pandas Match Vlookup columns based on header values
- Using DataFrame.lookup to get rows where columns names are a subset of a string
- Python: pandas: match row value to column name/ key's value
Some Current Answers to this Question:
- Reference DataFrame value corresponding to column header
- Pandas/Python: How to create new column based on values from other columns and apply extra condition to this new column
Standard LookUp Values With Default Range Index
Given the following DataFrame:
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
Col A B
0 B 1 5
1 A 2 6
2 A 3 7
3 B 4 8
I would like to be able to lookup the corresponding value in the column specified in Col
:
I would like my result to look like:
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
Standard LookUp Values With a Non-Default Index
Non-Contiguous Range Index
Given the following DataFrame:
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
Col A B
0 B 1 5
2 A 2 6
8 A 3 7
9 B 4 8
I would like to preserve the index but still find the correct corresponding Value:
Col A B Val
0 B 1 5 5
2 A 2 6 2
8 A 3 7 3
9 B 4 8 8
MultiIndex
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))
Col A B
C E B 1 5
F A 2 6
D E A 3 7
F B 4 8
I would like to preserve the index but still find the correct corresponding Value:
Col A B Val
C E B 1 5 5
F A 2 6 2
D E A 3 7 3
F B 4 8 8
LookUp with Default For Unmatched/Not-Found Values
Given the following DataFrame
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
Col A B
0 B 1 5
1 A 2 6
2 A 3 7
3 C 4 8 # Column C does not correspond with any column
I would like to look up the corresponding values if one exists otherwise I'd like to have it default to 0
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 C 4 8 0 # Default value 0 since C does not correspond
LookUp with Missing Values in the lookup Col
Given the following DataFrame:
Col A B
0 B 1 5
1 A 2 6
2 A 3 7
3 NaN 4 8 # <- Missing Lookup Key
I would like any NaN
values in Col
to result in a NaN
value in Val
Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 A 3 7 3.0
3 NaN 4 8 NaN # NaN to indicate missing
Solution
Standard LookUp Values With Any Index
The documentation on Looking up values by index/column labels recommends using NumPy indexing via factorize
and reindex
as the replacement for the deprecated DataFrame.lookup
.
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
df
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
factorize
is used to convert the column encode the values as an "enumerated type".
idx, col = pd.factorize(df['Col'])
# idx = array([0, 1, 1, 0], dtype=int64)
# col = Index(['B', 'A'], dtype='object')
Notice that B
corresponds to 0
and A
corresponds to 1
. reindex
is used to ensure that columns appear in the same order as the enumeration:
df.reindex(columns=col)
B A # B appears First (location 0) A appers second (location 1)
0 5 1
1 6 2
2 7 3
3 8 4
We need to create an appropriate range indexer compatible with NumPy indexing.
The standard approach is to use np.arange
based on the length of the DataFrame:
np.arange(len(df))
[0 1 2 3]
Now NumPy indexing will work to select values from the DataFrame:
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
[5 2 3 8]
*Note: This approach will always work regardless of type of index.
MultiIndex
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
Col A B Val
C E B 1 5 5
F A 2 6 2
D E A 3 7 3
F B 4 8 8
Why use np.arange
and not df.index
directly?
Standard Contiguous Range Index
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
In this case only, there is no error as the result from np.arange
is the same as the df.index
.
df
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
Non-Contiguous Range Index Error
Raises IndexError:
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
IndexError: index 8 is out of bounds for axis 0 with size 4
MultiIndex Error
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
Raises IndexError:
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
LookUp with Default For Unmatched/Not-Found Values
There are a few approaches.
First let's look at what happens by default if there is a non-corresponding value:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
# Col A B
# 0 B 1 5
# 1 A 2 6
# 2 A 3 7
# 3 C 4 8
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 A 3 7 3.0
3 C 4 8 NaN # NaN Represents the Missing Value in C
If we look at why the NaN
values are introduced, we will find that when factorize
goes through the column it will enumerate all groups present regardless of whether they correspond to a column or not.
For this reason, when we reindex
the DataFrame we will end up with the following result:
idx, col = pd.factorize(df['Col'])
df.reindex(columns=col)
idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col)
B A C
0 5 1 NaN
1 6 2 NaN
2 7 3 NaN
3 8 4 NaN # Reindex adds the missing column with the Default `NaN`
If we want to specify a default value, we can specify the fill_value
argument of reindex
which allows us to modify the behaviour as it relates to missing column values:
idx, col = pd.factorize(df['Col'])
df.reindex(columns=col, fill_value=0)
idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col, fill_value=0)
B A C
0 5 1 0
1 6 2 0
2 7 3 0
3 8 4 0 # Notice reindex adds missing column with specified value `0`
This means that we can do:
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(
columns=col,
fill_value=0 # Default value for Missing column values
).to_numpy()[np.arange(len(df)), idx]
df
:
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 C 4 8 0
*Notice the dtype
of the column is int
, since NaN
was never introduced, and, therefore, the column type was not changed.
LookUp with Missing Values in the lookup Col
factorize
has a default na_sentinel=-1
, meaning that when NaN
values appear in the column being factorized the resulting idx
value is -1
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
# Col A B
# 0 B 1 5
# 1 A 2 6
# 2 A 3 7
# 3 NaN 4 8 # <- Missing Lookup Key
idx, col = pd.factorize(df['Col'])
# idx = array([ 0, 1, 1, -1], dtype=int64)
# col = Index(['B', 'A'], dtype='object')
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
# Col A B Val
# 0 B 1 5 5
# 1 A 2 6 2
# 2 A 3 7 3
# 3 NaN 4 8 4 <- Value From A
This -1
means that, by default, we'll be pulling from the last column when we reindex. Notice the col
still only contains the values B
and A
. Meaning, that we will end up with the value from A
in Val
for the last row.
The easiest way to handle this is to fillna
Col
with some value that cannot be found in the column headers.
Here I use the empty string ''
:
idx, col = pd.factorize(df['Col'].fillna(''))
# idx = array([0, 1, 1, 2], dtype=int64)
# col = Index(['B', 'A', ''], dtype='object')
Now when I reindex, the ''
column will contain NaN
values meaning that the lookup produces the desired result:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
idx, col = pd.factorize(df['Col'].fillna(''))
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
df
:
Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 A 3 7 3.0
3 NaN 4 8 NaN # Missing as expected
Answered By - Henry Ecker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.