Issue
I would like to include multiple features in a classifier for better improving model performance. I have a dataset similar to this one
Text | is_it_capital? | is_it_upper? | contains_num? | Label |
---|---|---|---|---|
an example of text | 0 | 0 | 0 | 0 |
ANOTHER example of text | 1 | 1 | 0 | 1 |
What's happening?Let's talk at 5 | 1 | 0 | 1 | 1 |
I am applying different pre-processing algorithms to Text (BoW, TF-IDF,...). It was 'easy' to use only Text column in my classifier by selecting X= df['Text']
and applying the algorithm of pre-processing. However, I would like to include now also is_it_capital?
and the other variables (except Label) as features as I found them potentially useful for my classifier.
What I tried was the following:
X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']
from sklearn.base import TransformerMixin
class DenseTransformer(TransformerMixin):
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, y=None, **fit_params):
return X.todense()
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('to_dense', DenseTransformer()),
])
transformer = ColumnTransformer([('text', pipeline, 'Text')], remainder='passthrough')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=40)
X_train = transformer.fit_transform(X_train)
X_test = transformer.transform(X_test)
df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)
#Logistic regression
logR_pipeline = Pipeline([
('LogRCV',countV),
('LogR_clf',LogisticRegression())
])
logR_pipeline.fit(df_train['Text'], df_train['Label'])
predicted_LogR = logR_pipeline.predict(df_test['Text'])
np.mean(predicted_LogR == df_test['Label'])
However I got the error:
TypeError: cannot concatenate object of type '<class 'scipy.sparse.csr.csr_matrix'>'; only Series and DataFrame objs are valid
Is there anyone that handled with a similar problem? How could I fix it? My goal is to include all the features in my classifiers.
UPDATE:
I tried also with this:
from sklearn.base import BaseEstimator,TransformerMixin
class custom_count_v(BaseEstimator,TransformerMixin):
def __init__(self,tfidf):
self.tfidf = tfidf
def fit(self, X, y=None):
joined_X = X.apply(lambda x: ' '.join(x), axis=1)
self.tfidf.fit(joined_X)
return self
def transform(self, X):
joined_X = X.apply(lambda x: ' '.join(x), axis=1)
return self.tfidf.transform(joined_X)
count_v = CountVectorizer()
clmn = ColumnTransformer([("count", custom_count_v(count_v), ['Text'])],remainder="passthrough")
clmn.fit_transform(df)
It does not return any error, but it is not clear if I am including all the features correctly, and if I need to do it before or after the train/test split.It would be extremely helpful if you could show me until the application of the classifier:
#Logistic regression
logR_pipeline = Pipeline([
('LogRCV',....),
('LogR_clf',LogisticRegression())
])
logR_pipeline.fit(....)
predicted_LogR = logR_pipeline.predict(...)
np.mean(predicted_LogR == ...)
where instead of dots there should be dataframe or column (it depends on the transformation and concatenation, I guess), in order to get better the steps and errors I made.
Solution
Your error seems to try to concat arrays and series.
I'm not familiar with pipeline and columntransformer, so I may be mistaken ; it seems though that it doesn't capture the feature names from CountVectorizer, so it won't do any good to have an unlabelled dataframe : maybe you could stick to numpy arrays. If I'm mistaken, it should be easy enough to jump from np.array to dataframe anyway...
So, you could do :
df_train = np.append(
X_train, #this is an array
np.array(y_train).reshape(len(y_train),1), #convert the Serie to numpy array of correct shape
axis=1)
print(df_train)
[[1 0 1 0 0 1 0 1 0 1 1 0 1]
[0 1 0 1 1 0 1 0 1 1 0 1 1]]
Hope this helps (though as I said, I'm not familiar with these sklearn libraries...)
EDIT
Something more complete and without those pipelines (which I'm not sure are needed anyway) ; it is failing on my computer because of the input dataset, but you may have more success with your complete dataset.
df = pd.DataFrame(
[["an example of text", 0, 0, 0, 0],
["ANOTHER example of text", 1, 1, 0, 1],
["What's happening?Let's talk at 5", 1, 0, 1, 1]
],
columns=["Text", "is_it_capital?", "is_it_upper?", "contains_num?", "Label"]
)
X=df[['Text','is_it_capital?', 'is_it_upper?', 'contains_num?']]
y=df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=40)
cv = CountVectorizer()
X_train = (
pd.DataFrame(
cv.fit_transform(X_train['Text']).toarray(),
columns=cv.get_feature_names(),
index=X_train.index
) #This way you keep the labels/indexes in a dataframe format
.join(X_train.drop('Text', axis=1)) #add your previous 'get_dummies' columns
)
X_test = (
pd.DataFrame(
cv.transform(X_test['Text']).toarray(),
columns=cv.get_feature_names(),
index=X_test.index
)
.join(X_test.drop('Text', axis=1))
)
#Then compute your regression directly :
lr = LogisticRegression()
lr = lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
Answered By - tgrandje
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.