Issue
I would like to know if when I use a classifier, for example:
random_forest_bow = Pipeline([
('rf_tfidf',Feat_Selection. countV),
('rf_clf',RandomForestClassifier(n_estimators=300,n_jobs=3))
])
random_forest_ngram.fit(DataPrep.train['Text'],DataPrep.train['Label'])
predicted_rf_ngram = random_forest_ngram.predict(DataPrep.test_news['Text'])
np.mean(predicted_rf_ngram == DataPrep.test_news['Label'])
I am also considering other features in the model. I defined X and y as follows:
X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=40)
df_train= pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)
countV = CountVectorizer()
train_count = countV.fit_transform(df.train['Text'].values)
My dataset looks as follows
Text is_it_capital? is_it_upper? contains_num? Label
an example of text 0 0 0 0
ANOTHER example of text 1 1 0 1
What's happening?Let's talk at 5 1 0 1 1
I would like to use as features also is_it_capital?
,is_it_upper?
,contains_num?
, but since they have binary values (1 or 0, after encoding), I should apply BoW only on Text to extract extra features.
Maybe my question is obvious, but since I am a new ML learner and I am not familiar with classifiers and encoding, I will be thankful for all the support and comments that you will provide. Thanks
Solution
You can certainly use your "extra" features like is_it_capital?
, is_it_upper?
, and contains_num?
. It seems you're struggling with how exactly to combine the two seemingly disparate feature sets. You could use something like sklearn.pipeline.FeatureUnion or sklearn.compose.ColumnTransformer to apply your different encoding strategies to each set of features. There's no reason you couldn't use your extra features in combinations with whatever a text-feature extraction method (e.g. your BoW approach) would produce.
df = pd.DataFrame({'text': ['this is some text', 'this is some MORE text', 'hi hi some text 123', 'bananas oranges'], 'is_it_upper': [0, 1, 0, 0], 'contains_num': [0, 0, 1, 0]})
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer([('text', CountVectorizer(), 'text')], remainder='passthrough')
X = transformer.fit_transform(df)
print(X)
[[0 0 0 1 0 0 1 1 1 0 0]
[0 0 0 1 1 0 1 1 1 1 0]
[1 0 2 0 0 0 1 1 0 0 1]
[0 1 0 0 0 1 0 0 0 0 0]]
print(transformer.get_feature_names())
['text__123', 'text__bananas', 'text__hi', 'text__is', 'text__more', 'text__oranges', 'text__some', 'text__text', 'text__this', 'is_it_upper', 'contains_num']
More on your specific example:
X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']
# Need to use DenseTransformer to properly concatenate results
# from CountVectorizer and other transformer steps
from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, y=None, **fit_params):
return X.todense()
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('to_dense', DenseTransformer()),
])
transformer = ColumnTransformer([('text', pipeline, 'Text')], remainder='passthrough')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=40)
X_train = transformer.fit_transform(X_train)
X_test = transformer.transform(X_test)
df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)
Answered By - blacksite
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.