Issue
Running this code:
import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'key': ['key1', 'key2', 'key3', 'key1', 'key2'],
'colA': ['value1A', 'value2A', 'value3A', 'value4A', 'value5A'],
'colB': ['value1B', 'value2B', 'value3B', 'value4B', 'value5B'],
'colC': ['value1C', 'value2C', 'value3C', 'value4C', 'value5C'],
'colD': ['value1D', 'value2D', 'value3D', 'value4D', 'value5D']
}
df = pd.DataFrame(data)
# Create conditions
conditions = [df['key'] == 'key1',
df['key'] == 'key2',
df['key'] == 'key3']
# Apply conditions and choices to the respective columns
df['colA'] = np.select([df['key'] == 'key1'], [df['colA']], default= 'NA')
df['colD'] = np.select([df['key'] == 'key1'], [df['colD']], default= 'NA')
df['colB'] = np.select([df['key'] == 'key2'], [df['colB']], default= 'NA')
df['colC'] = np.select([df['key'] == 'key3'], [df['colC']], default= 'NA')
# Display the resulting DataFrame
print(df)
Produces this:
key colA colB colC colD
0 key1 value1A NA NA value1D
1 key2 NA value2B NA NA
2 key3 NA NA value3C NA
3 key1 value4A NA NA value4D
4 key2 NA value5B NA NA
Is there a way to rewrite this efficiently so I don't have to do numpy.select for each column I want to map? The "key" column essentially determines which columns contain valid data for that particular row. If the data is invalid I want to mark it as NA, if it is valid I want to keep the value that's in the row. In my real dataset, the "key" column controls how I would map more than one column like how key1 controls the mapping to colA and colD above. I prefer to use numpy or other vectorized methods as from what I understand it is faster than other methods such as map but I'm open to hearing any and all ideas.
Solution
You could build a mask in a vectorial way with a dictionary of the key/columns, get_dummies
and use it to mask
the unwanted values:
d = {'key1': ['colA', 'colD'],
'key2': ['colB'],
'key3': ['colC'],
}
s = pd.Series(d).explode()
mask = pd.get_dummies(s, dtype=bool).groupby(level=0).max()
cols = df.columns.difference(['key'])
df[cols] = df[cols].where(mask.reindex(df['key']).to_numpy(), 'NA')
Output:
key colA colB colC colD
0 key1 value1A NA NA value1D
1 key2 NA value2B NA NA
2 key3 NA NA value3C NA
3 key1 value4A NA NA value4D
4 key2 NA value5B NA NA
Intermediates:
# mask
colA colB colC colD
key1 True False False True
key2 False True False False
key3 False False True False
# mask.reindex(df['key']).to_numpy()
[[ True False False True]
[False True False False]
[False False True False]
[ True False False True]
[False True False False]]
Another approach could be to use reshaping of your data, filtering with merge
:
d = {'key1': ['colA', 'colD'],
'key2': ['colB'],
'key3': ['colC'],
}
(df.reset_index()
.melt(['index', 'key'])
.merge(pd.Series(d).explode().rename_axis('key')
.reset_index(name='variable'))
.set_index(['index', 'key', 'variable'])['value']
.unstack('variable', fill_value='NA')
.reset_index('key').rename_axis(index=None, columns=None)
)
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.