Issue
Suppose I have a pandas.DataFrame
with two categorical columns say, Card
with two possible values "Visa" and "Master" and Success
which takes values either 0
or 1
. Example:
Index | Card | Success |
---|---|---|
1 | Visa | 0 |
_ | Master | 1 |
2 | Visa | 1 |
_ | Master | 0 |
3 | Visa | 1 |
4 | Master | 1 |
I want to transform it into a dataframe such that Card and Success attributes are joined and the following dataframe is produced:
Index | Card_Visa_Success | Card_Master_Success |
---|---|---|
1 | 0 | 1 |
2 | 1 | 0 |
3 | 1 | 0 |
4 | 0 | 1 |
(default value of success is 0 if the corresponding row does not exist)
One way I could think of doing this is by using one hot encoding for Card attribute alone and then taking an AND (df['Card_Visa_Success'] = df['Card_Visa'] & df['Success']
) with success column which would keep 1 only if both attributes are same. I want to ask if there is a simpler way of combining two columns of dataframes using some pandas
or sklearn
function? In addition to that, could you also help me writing a custom transformer for this operation so that I can integrate it in my sklearn
pipeline? Thanks!
Solution
If you have a multi-indexed pd.DataFrame
as shown in the input data I think you can get the result by using unstack
to pull one index level from the row -index into a column-index.
As the unstack
operation gives us a multi-index for the columns, we need to merge the multi-index into a regular ones using a map
operation.
I assume if there is no content in the DataFrame
for a specific card-index combination it is not considered "successful", so we fill the missing values with 0
s.
import pandas as pd
data = {
'Index': [1, 1, 2, 2, 3, 4],
'Card': ['Visa', 'Master', 'Visa', 'Master', 'Visa', 'Master'],
'Success': [0, 1, 1, 0, 1, 1]
}
df = pd.DataFrame(data).set_index(['Index','Card']).unstack().fillna(0.0)
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
print(df)
Success_Master Success_Visa
Index
1 1.0 0.0
2 0.0 1.0
3 0.0 1.0
4 1.0 0.0
To fit this in an sklearn
pipeline, you just need to implement a transform
method on a Transformer
:
from sklearn.base import BaseEstimator, TransformerMixin
class CardSuccessTransformer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
X_transformed = X.copy()
X_transformed = X_transformed.set_index(['Index', 'Card']).unstack().fillna(0.0)
X_transformed.columns = X_transformed.columns.map('{0[0]}_{0[1]}'.format)
return X_transformed
You can find more information about how to write custom transformers and use them in your pipeline here:
pipe = Pipeline(
steps=[
("card_success_transformer", CardSuccessTransformer())
]
)
transformed_df = pipe.fit_transform(df)
Answered By - Sebastian Wozny
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.