Saturday, May 21, 2022

[FIXED] Creating dummy variables using Scikit-Learn's feature_extraction

May 21, 2022 python, scikit-learn No comments

Issue

My goal is I would like to create dummy variables from a column using SkLearn.

So I have data as follows:

INDICATOR MATCHUP 
1         [   "APPLE",   "GRAPE" ]
1         [   "APPLE",   "GRAPE" ]
0         [   "GRAPE",   "BANANA" ]
0         [   "PEAR",   "ORANGE" ]
1         [   "ORANGE",   "APPLE" ]

Dictionary of the data is as follows:

{'INDICATOR': [1, 1, 0, 0, 1],
 'MATCHUP': ['[   "APPLE",   "GRAPE" ]',
  '[   "APPLE",   "GRAPE" ]',
  '[   "GRAPE",   "BANANA" ]',
  '[   "PEAR",   "ORANGE" ]',
  '[   "ORANGE",   "APPLE" ]']}

So I am looking to utilize Sklearn's text TfidfVectorizer. I need to use this package due to the nature of the pipeline I am building.

Final Outcome:

INDICATOR MATCHUP                    APPLE GRAPE BANANA PEAR ORANGE
1         [   "APPLE",   "GRAPE" ]   1     1     0      0    0 
1         [   "APPLE",   "GRAPE" ]   1     1     0      0    0
0         [   "GRAPE",   "BANANA" ]  0     1     1      0    0
0         [   "PEAR",   "ORANGE" ]   0     0     0      1    1
1         [   "ORANGE",   "APPLE" ]  1     0     0      0    1

I was able to succeed in the manipulation without Sklearn (see below) but I need to now use this Sklearn function to do so.

df.join(df['MATCHUP'].map(ast.literal_eval).explode().str.get_dummies().groupby(level=0).sum())

I cannot use this since it will eventually go into a ColumnTransformer, so if we can use SciKit-Learn, I would appreciate it.

Solution

Using a built-in text processor out of sklearn is much easier than the manual method you build yourself. It automatically takes care of the fact that your column of lists is actually a column of strings that look like lists by ignoring non-alphanumeric characters. I will also show the difference between the tfidfvecorizer that you say you must use, and the countvectorizer that would produce your given output.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer as cv
from sklearn.feature_extraction.text import TfidfVectorizer as tv

df=pd.DataFrame({'INDICATOR': [1, 1, 0, 0, 1],
 'MATCHUP': ['[   "APPLE",   "GRAPE" ]',
  '[   "APPLE",   "GRAPE" ]',
  '[   "GRAPE",   "BANANA" ]',
  '[   "PEAR",   "ORANGE" ]',
  '[   "ORANGE",   "APPLE" ]']}).set_index('INDICATOR')

df

    MATCHUP
INDICATOR   
1   [ "APPLE", "GRAPE" ]
1   [ "APPLE", "GRAPE" ]
0   [ "GRAPE", "BANANA" ]
0   [ "PEAR", "ORANGE" ]
1   [ "ORANGE", "APPLE" ]

Here it is worth noting the distinction between the output you gave and what the normal output for one of these vectorizers is; namely a sparse matrix. Sklearn doesn't much care either way how you pass in the data, but I will show both the native output and the dataframe output.

TL;DR: Initialize a vectorizer, then use it to `.fit_transform(df['MATCHUP'])`

#first we apply the CountVectorizer to get the desired binary output
tf=cv()
#we will print the human-friendly version of the sparse matrix for comparison
print(tf.fit_transform(df['MATCHUP']))

  (0, 0)    1
  (0, 2)    1
  (1, 0)    1
  (1, 2)    1
  (2, 2)    1
  (2, 1)    1
  (3, 4)    1
  (3, 3)    1
  (4, 0)    1
  (4, 3)    1

#then we also convert to dense format and make a dataframe to show how it looks
count_df=pd.DataFrame(tf.fit_transform(df['MATCHUP']).todense(), columns=tf.get_feature_names())
print(count_df)

    apple   banana  grape   orange  pear
0   1       0       1       0       0
1   1       0       1       0       0
2   0       1       1       0       0
3   0       0       0       1       1
4   1       0       0       1       0

Then we can do the same with the tfidf to show the difference in output without the count vectors normalized with the idf (note the very similar output)

tf=tv()
print(tf.fit_transform(df['MATCHUP']))

  (0, 2)    0.7071067811865476
  (0, 0)    0.7071067811865476
  (1, 2)    0.7071067811865476
  (1, 0)    0.7071067811865476
  (2, 1)    0.830880748357988
  (2, 2)    0.5564505207186616
  (3, 3)    0.6279137616509933
  (3, 4)    0.7782829228046183
  (4, 3)    0.7694470729725092
  (4, 0)    0.6387105775654869

tfidf_df=pd.DataFrame(tf.fit_transform(df['MATCHUP']).todense(), columns=tf.get_feature_names())
print(tfidf_df)

    apple       banana      grape       orange      pear
0   0.707107    0.000000    0.707107    0.000000    0.000000
1   0.707107    0.000000    0.707107    0.000000    0.000000
2   0.000000    0.830881    0.556451    0.000000    0.000000
3   0.000000    0.000000    0.000000    0.627914    0.778283
4   0.638711    0.000000    0.000000    0.769447    0.000000

Then, to complete the view that matches the requested output, we link the result back to the original (Note that this step is wholly unnecessary if you are using the transformers with a columntransformer/pipeline)

print(pd.concat([df.reset_index(),count_df], axis=1))

    INDICATOR   MATCHUP         apple   banana  grape   orange  pear
0   1       [ "APPLE", "GRAPE" ]    1       0   1       0       0
1   1       [ "APPLE", "GRAPE" ]    1       0   1       0       0
2   0       [ "GRAPE", "BANANA" ]   0       1   1       0       0
3   0       [ "PEAR", "ORANGE" ]    0       0   0       1       1
4   1       [ "ORANGE", "APPLE" ]   1       0   0       1       0

print(pd.concat([df.reset_index(),tfidf_df], axis=1))

    INDICATOR   MATCHUP         apple       banana      grape       orange      pear
0   1   [ "APPLE", "GRAPE" ]    0.707107    0.000000    0.707107    0.000000    0.000000
1   1   [ "APPLE", "GRAPE" ]    0.707107    0.000000    0.707107    0.000000    0.000000
2   0   [ "GRAPE", "BANANA" ]   0.000000    0.830881    0.556451    0.000000    0.000000
3   0   [ "PEAR", "ORANGE" ]    0.000000    0.000000    0.000000    0.627914    0.778283
4   1   [ "ORANGE", "APPLE" ]   0.638711    0.000000    0.000000    0.769447    0.000000

Answered By - G. Anderson

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, May 21, 2022

[FIXED] Creating dummy variables using Scikit-Learn's feature_extraction

Issue

Solution

TL;DR: Initialize a vectorizer, then use it to `.fit_transform(df['MATCHUP'])`

0 comments:

Post a Comment

Popular Posts

Labels

Saturday, May 21, 2022

Issue

Solution

TL;DR: Initialize a vectorizer, then use it to .fit_transform(df['MATCHUP'])

0 comments:

Post a Comment

Popular Posts

Labels

TL;DR: Initialize a vectorizer, then use it to `.fit_transform(df['MATCHUP'])`