Monday, June 6, 2022

[FIXED] Lookup Values by Corresponding Column Header in Pandas 1.2.0 or newer

June 06, 2022 dataframe, pandas, python No comments

Issue

The operation pandas.DataFrame.lookup is "Deprecated since version 1.2.0", and has since invalidated a lot of previous answers.

This post attempts to function as a canonical resource for looking up corresponding row col pairs in pandas versions 1.2.0 and newer.

Some previous answers to this type of question (now deprecated):

Some Current Answers to this Question:

Standard LookUp Values With Default Range Index

Given the following DataFrame:

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})
  Col  A  B
0   B  1  5
1   A  2  6
2   A  3  7
3   B  4  8

I would like to be able to lookup the corresponding value in the column specified in Col:

I would like my result to look like:

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   B  4  8    8

Standard LookUp Values With a Non-Default Index

Non-Contiguous Range Index

Given the following DataFrame:

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]}, 
                  index=[0, 2, 8, 9])

  Col  A  B
0   B  1  5
2   A  2  6
8   A  3  7
9   B  4  8

I would like to preserve the index but still find the correct corresponding Value:

  Col  A  B  Val
0   B  1  5    5
2   A  2  6    2
8   A  3  7    3
9   B  4  8    8

MultiIndex

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]},
                  index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))

    Col  A  B
C E   B  1  5
  F   A  2  6
D E   A  3  7
  F   B  4  8

I would like to preserve the index but still find the correct corresponding Value:

    Col  A  B  Val
C E   B  1  5    5
  F   A  2  6    2
D E   A  3  7    3
  F   B  4  8    8

LookUp with Default For Unmatched/Not-Found Values

Given the following DataFrame

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})

  Col  A  B
0   B  1  5
1   A  2  6
2   A  3  7
3   C  4  8  # Column C does not correspond with any column

I would like to look up the corresponding values if one exists otherwise I'd like to have it default to 0

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   C  4  8    0  # Default value 0 since C does not correspond

LookUp with Missing Values in the lookup Col

Given the following DataFrame:

   Col  A  B
0    B  1  5
1    A  2  6
2    A  3  7
3  NaN  4  8  # <- Missing Lookup Key

I would like any NaN values in Col to result in a NaN value in Val

   Col  A  B  Val
0    B  1  5  5.0
1    A  2  6  2.0
2    A  3  7  3.0
3  NaN  4  8  NaN  # NaN to indicate missing

Solution

Standard LookUp Values With Any Index

The documentation on Looking up values by index/column labels recommends using NumPy indexing via factorize and reindex as the replacement for the deprecated DataFrame.lookup.

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]},
                  index=[0, 2, 8, 9])

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

df

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   B  4  8    8

factorize is used to convert the column encode the values as an "enumerated type".

idx, col = pd.factorize(df['Col'])
# idx = array([0, 1, 1, 0], dtype=int64)
# col = Index(['B', 'A'], dtype='object')

Notice that B corresponds to 0 and A corresponds to 1. reindex is used to ensure that columns appear in the same order as the enumeration:

df.reindex(columns=col)

   B  A  # B appears First (location 0) A appers second (location 1)
0  5  1
1  6  2
2  7  3
3  8  4

We need to create an appropriate range indexer compatible with NumPy indexing.

The standard approach is to use np.arange based on the length of the DataFrame:

np.arange(len(df))

[0 1 2 3]

Now NumPy indexing will work to select values from the DataFrame:

df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

[5 2 3 8]

*Note: This approach will always work regardless of type of index.

MultiIndex

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]},
                  index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

    Col  A  B  Val
C E   B  1  5    5
  F   A  2  6    2
D E   A  3  7    3
  F   B  4  8    8

Why use `np.arange` and not `df.index` directly?

Standard Contiguous Range Index

import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

In this case only, there is no error as the result from np.arange is the same as the df.index. df

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   B  4  8    8

Non-Contiguous Range Index Error

Raises IndexError:

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]},
                  index=[0, 2, 8, 9])

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

IndexError: index 8 is out of bounds for axis 0 with size 4

MultiIndex Error

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]},
                  index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

Raises IndexError:

df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

LookUp with Default For Unmatched/Not-Found Values

There are a few approaches.

First let's look at what happens by default if there is a non-corresponding value:

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})
#   Col  A  B
# 0   B  1  5
# 1   A  2  6
# 2   A  3  7
# 3   C  4  8

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

  Col  A  B  Val
0   B  1  5  5.0
1   A  2  6  2.0
2   A  3  7  3.0
3   C  4  8  NaN  # NaN Represents the Missing Value in C

If we look at why the NaN values are introduced, we will find that when factorize goes through the column it will enumerate all groups present regardless of whether they correspond to a column or not.

For this reason, when we reindex the DataFrame we will end up with the following result:

idx, col = pd.factorize(df['Col'])
df.reindex(columns=col)

idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col)
   B  A   C
0  5  1 NaN
1  6  2 NaN
2  7  3 NaN
3  8  4 NaN  # Reindex adds the missing column with the Default `NaN`

If we want to specify a default value, we can specify the fill_value argument of reindex which allows us to modify the behaviour as it relates to missing column values:

idx, col = pd.factorize(df['Col'])
df.reindex(columns=col, fill_value=0)

idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col, fill_value=0)
   B  A  C
0  5  1  0
1  6  2  0
2  7  3  0
3  8  4  0  # Notice reindex adds missing column with specified value `0`

This means that we can do:

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(
    columns=col, 
    fill_value=0  # Default value for Missing column values
).to_numpy()[np.arange(len(df)), idx]

df:

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   C  4  8    0

*Notice the dtype of the column is int, since NaN was never introduced, and, therefore, the column type was not changed.

LookUp with Missing Values in the lookup Col

factorize has a default na_sentinel=-1, meaning that when NaN values appear in the column being factorized the resulting idx value is -1

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})
#    Col  A  B
# 0    B  1  5
# 1    A  2  6
# 2    A  3  7
# 3  NaN  4  8  # <- Missing Lookup Key

idx, col = pd.factorize(df['Col'])
# idx = array([ 0,  1,  1, -1], dtype=int64)
# col = Index(['B', 'A'], dtype='object')
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
#    Col  A  B  Val
# 0    B  1  5    5
# 1    A  2  6    2
# 2    A  3  7    3
# 3  NaN  4  8    4 <- Value From A

This -1 means that, by default, we'll be pulling from the last column when we reindex. Notice the col still only contains the values B and A. Meaning, that we will end up with the value from A in Val for the last row.

The easiest way to handle this is to fillna Col with some value that cannot be found in the column headers.

Here I use the empty string '':

idx, col = pd.factorize(df['Col'].fillna(''))
# idx = array([0, 1, 1, 2], dtype=int64)
# col = Index(['B', 'A', ''], dtype='object')

Now when I reindex, the '' column will contain NaN values meaning that the lookup produces the desired result:

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})

idx, col = pd.factorize(df['Col'].fillna(''))
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

df:

   Col  A  B  Val
0    B  1  5  5.0
1    A  2  6  2.0
2    A  3  7  3.0
3  NaN  4  8  NaN  # Missing as expected

Answered By - Henry Ecker

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, June 6, 2022

[FIXED] Lookup Values by Corresponding Column Header in Pandas 1.2.0 or newer

Issue

Standard LookUp Values With Default Range Index

Standard LookUp Values With a Non-Default Index

Non-Contiguous Range Index

MultiIndex

LookUp with Default For Unmatched/Not-Found Values

LookUp with Missing Values in the lookup Col

Solution

Standard LookUp Values With Any Index

MultiIndex

Why use `np.arange` and not `df.index` directly?

Standard Contiguous Range Index

Non-Contiguous Range Index Error

MultiIndex Error

LookUp with Default For Unmatched/Not-Found Values

LookUp with Missing Values in the lookup Col

0 comments:

Post a Comment

Popular Posts

Labels

Monday, June 6, 2022

Issue

Standard LookUp Values With Default Range Index

Standard LookUp Values With a Non-Default Index

Non-Contiguous Range Index

MultiIndex

LookUp with Default For Unmatched/Not-Found Values

LookUp with Missing Values in the lookup Col

Solution

Standard LookUp Values With Any Index

MultiIndex

Why use np.arange and not df.index directly?

Standard Contiguous Range Index

Non-Contiguous Range Index Error

MultiIndex Error

LookUp with Default For Unmatched/Not-Found Values

LookUp with Missing Values in the lookup Col

0 comments:

Post a Comment

Popular Posts

Labels

Why use `np.arange` and not `df.index` directly?