Issue
My goal is I would like to create dummy variables from a column using SkLearn.
So I have data as follows:
INDICATOR MATCHUP
1 [ "APPLE", "GRAPE" ]
1 [ "APPLE", "GRAPE" ]
0 [ "GRAPE", "BANANA" ]
0 [ "PEAR", "ORANGE" ]
1 [ "ORANGE", "APPLE" ]
Dictionary of the data is as follows:
{'INDICATOR': [1, 1, 0, 0, 1],
'MATCHUP': ['[ "APPLE", "GRAPE" ]',
'[ "APPLE", "GRAPE" ]',
'[ "GRAPE", "BANANA" ]',
'[ "PEAR", "ORANGE" ]',
'[ "ORANGE", "APPLE" ]']}
So I am looking to utilize Sklearn's text TfidfVectorizer. I need to use this package due to the nature of the pipeline I am building.
Final Outcome:
INDICATOR MATCHUP APPLE GRAPE BANANA PEAR ORANGE
1 [ "APPLE", "GRAPE" ] 1 1 0 0 0
1 [ "APPLE", "GRAPE" ] 1 1 0 0 0
0 [ "GRAPE", "BANANA" ] 0 1 1 0 0
0 [ "PEAR", "ORANGE" ] 0 0 0 1 1
1 [ "ORANGE", "APPLE" ] 1 0 0 0 1
I was able to succeed in the manipulation without Sklearn (see below) but I need to now use this Sklearn function to do so.
df.join(df['MATCHUP'].map(ast.literal_eval).explode().str.get_dummies().groupby(level=0).sum())
I cannot use this since it will eventually go into a ColumnTransformer, so if we can use SciKit-Learn, I would appreciate it.
Solution
Using a built-in text processor out of sklearn is much easier than the manual method you build yourself. It automatically takes care of the fact that your column of lists is actually a column of strings that look like lists by ignoring non-alphanumeric characters. I will also show the difference between the tfidfvecorizer that you say you must use, and the countvectorizer that would produce your given output.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer as cv
from sklearn.feature_extraction.text import TfidfVectorizer as tv
df=pd.DataFrame({'INDICATOR': [1, 1, 0, 0, 1],
'MATCHUP': ['[ "APPLE", "GRAPE" ]',
'[ "APPLE", "GRAPE" ]',
'[ "GRAPE", "BANANA" ]',
'[ "PEAR", "ORANGE" ]',
'[ "ORANGE", "APPLE" ]']}).set_index('INDICATOR')
df
MATCHUP
INDICATOR
1 [ "APPLE", "GRAPE" ]
1 [ "APPLE", "GRAPE" ]
0 [ "GRAPE", "BANANA" ]
0 [ "PEAR", "ORANGE" ]
1 [ "ORANGE", "APPLE" ]
Here it is worth noting the distinction between the output you gave and what the normal output for one of these vectorizers is; namely a sparse matrix. Sklearn doesn't much care either way how you pass in the data, but I will show both the native output and the dataframe output.
TL;DR: Initialize a vectorizer, then use it to .fit_transform(df['MATCHUP'])
#first we apply the CountVectorizer to get the desired binary output
tf=cv()
#we will print the human-friendly version of the sparse matrix for comparison
print(tf.fit_transform(df['MATCHUP']))
(0, 0) 1
(0, 2) 1
(1, 0) 1
(1, 2) 1
(2, 2) 1
(2, 1) 1
(3, 4) 1
(3, 3) 1
(4, 0) 1
(4, 3) 1
#then we also convert to dense format and make a dataframe to show how it looks
count_df=pd.DataFrame(tf.fit_transform(df['MATCHUP']).todense(), columns=tf.get_feature_names())
print(count_df)
apple banana grape orange pear
0 1 0 1 0 0
1 1 0 1 0 0
2 0 1 1 0 0
3 0 0 0 1 1
4 1 0 0 1 0
Then we can do the same with the tfidf to show the difference in output without the count vectors normalized with the idf (note the very similar output)
tf=tv()
print(tf.fit_transform(df['MATCHUP']))
(0, 2) 0.7071067811865476
(0, 0) 0.7071067811865476
(1, 2) 0.7071067811865476
(1, 0) 0.7071067811865476
(2, 1) 0.830880748357988
(2, 2) 0.5564505207186616
(3, 3) 0.6279137616509933
(3, 4) 0.7782829228046183
(4, 3) 0.7694470729725092
(4, 0) 0.6387105775654869
tfidf_df=pd.DataFrame(tf.fit_transform(df['MATCHUP']).todense(), columns=tf.get_feature_names())
print(tfidf_df)
apple banana grape orange pear
0 0.707107 0.000000 0.707107 0.000000 0.000000
1 0.707107 0.000000 0.707107 0.000000 0.000000
2 0.000000 0.830881 0.556451 0.000000 0.000000
3 0.000000 0.000000 0.000000 0.627914 0.778283
4 0.638711 0.000000 0.000000 0.769447 0.000000
Then, to complete the view that matches the requested output, we link the result back to the original (Note that this step is wholly unnecessary if you are using the transformers with a columntransformer/pipeline)
print(pd.concat([df.reset_index(),count_df], axis=1))
INDICATOR MATCHUP apple banana grape orange pear
0 1 [ "APPLE", "GRAPE" ] 1 0 1 0 0
1 1 [ "APPLE", "GRAPE" ] 1 0 1 0 0
2 0 [ "GRAPE", "BANANA" ] 0 1 1 0 0
3 0 [ "PEAR", "ORANGE" ] 0 0 0 1 1
4 1 [ "ORANGE", "APPLE" ] 1 0 0 1 0
print(pd.concat([df.reset_index(),tfidf_df], axis=1))
INDICATOR MATCHUP apple banana grape orange pear
0 1 [ "APPLE", "GRAPE" ] 0.707107 0.000000 0.707107 0.000000 0.000000
1 1 [ "APPLE", "GRAPE" ] 0.707107 0.000000 0.707107 0.000000 0.000000
2 0 [ "GRAPE", "BANANA" ] 0.000000 0.830881 0.556451 0.000000 0.000000
3 0 [ "PEAR", "ORANGE" ] 0.000000 0.000000 0.000000 0.627914 0.778283
4 1 [ "ORANGE", "APPLE" ] 0.638711 0.000000 0.000000 0.769447 0.000000
Answered By - G. Anderson
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.