Tuesday, February 1, 2022

[FIXED] Are all the features correctly selected and used in a classifier?

February 01, 2022 feature-selection, machine-learning, python, scikit-learn No comments

Issue

I would like to know if when I use a classifier, for example:

random_forest_bow = Pipeline([
        ('rf_tfidf',Feat_Selection. countV),
        ('rf_clf',RandomForestClassifier(n_estimators=300,n_jobs=3))
        ])
    
random_forest_ngram.fit(DataPrep.train['Text'],DataPrep.train['Label'])
predicted_rf_ngram = random_forest_ngram.predict(DataPrep.test_news['Text'])
np.mean(predicted_rf_ngram == DataPrep.test_news['Label'])

I am also considering other features in the model. I defined X and y as follows:

X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.25, random_state=40) 

df_train= pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

countV = CountVectorizer()
train_count = countV.fit_transform(df.train['Text'].values)

My dataset looks as follows

Text                             is_it_capital?     is_it_upper?      contains_num?   Label
an example of text                      0                  0               0            0
ANOTHER example of text                 1                  1               0            1
What's happening?Let's talk at 5        1                  0               1            1

I would like to use as features also is_it_capital?,is_it_upper?,contains_num?, but since they have binary values (1 or 0, after encoding), I should apply BoW only on Text to extract extra features. Maybe my question is obvious, but since I am a new ML learner and I am not familiar with classifiers and encoding, I will be thankful for all the support and comments that you will provide. Thanks

Solution

You can certainly use your "extra" features like is_it_capital?, is_it_upper?, and contains_num?. It seems you're struggling with how exactly to combine the two seemingly disparate feature sets. You could use something like sklearn.pipeline.FeatureUnion or sklearn.compose.ColumnTransformer to apply your different encoding strategies to each set of features. There's no reason you couldn't use your extra features in combinations with whatever a text-feature extraction method (e.g. your BoW approach) would produce.

df = pd.DataFrame({'text': ['this is some text', 'this is some MORE text', 'hi hi some text 123', 'bananas oranges'], 'is_it_upper': [0, 1, 0, 0], 'contains_num': [0, 0, 1, 0]})

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer

transformer = ColumnTransformer([('text', CountVectorizer(), 'text')], remainder='passthrough')
X = transformer.fit_transform(df)

print(X)
[[0 0 0 1 0 0 1 1 1 0 0]
 [0 0 0 1 1 0 1 1 1 1 0]
 [1 0 2 0 0 0 1 1 0 0 1]
 [0 1 0 0 0 1 0 0 0 0 0]]
print(transformer.get_feature_names())
['text__123', 'text__bananas', 'text__hi', 'text__is', 'text__more', 'text__oranges', 'text__some', 'text__text', 'text__this', 'is_it_upper', 'contains_num']

More on your specific example:

X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']

# Need to use DenseTransformer to properly concatenate results
# from CountVectorizer and other transformer steps
from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):
    def fit(self, X, y=None, **fit_params):
        return self
    def transform(self, X, y=None, **fit_params):
        return X.todense()

from sklearn.pipeline import Pipeline
pipeline = Pipeline([
     ('vectorizer', CountVectorizer()), 
     ('to_dense', DenseTransformer()), 
])

transformer = ColumnTransformer([('text', pipeline, 'Text')], remainder='passthrough')

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.25, random_state=40)

X_train = transformer.fit_transform(X_train)
X_test = transformer.transform(X_test)

df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

Answered By - blacksite

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, February 1, 2022

[FIXED] Are all the features correctly selected and used in a classifier?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels