Issue
I have a features DF that looks like
text | number |
---|---|
text1 | 0 |
text2 | 1 |
... | ... |
where the number
column is binary and the text
column contains texts with ~2k characters in each row. The targets DF contains three classes.
def get_numeric_data(x):
return [x.number.values]
def get_text_data(x):
return [record for record in x.text.values]
transfomer_numeric = FunctionTransformer(get_numeric_data)
transformer_text = FunctionTransformer(get_text_data)
and when trying to fit, code below, I get the error File "C:\fakepath\scipy\sparse\construct.py", line 588, in bmat raise ValueError(msg) ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 98, expected 1.
. I tried to build functions get_text_data
and get_numerical_data
in different ways but none helped.
combined_clf = Pipeline([
('features', FeatureUnion([
('numeric_features', Pipeline([
('selector', transfomer_numeric)
])),
('text_features', Pipeline([
('selector', transformer_text),
('vect', vect),
('tfidf', tfidf),
('scaler', scl),
]))
])),
('clf', SGDClassifier(random_state=42,
max_iter=int(10 ** 6 / len(X_train)), shuffle=True))
])
gs_clf = GridSearchCV(combined_clf, parameters, cv=5,n_jobs=-1)
gs_clf.fit(X_train, y_train)
Solution
The main problem is the way you are returning the numeric values. x.number.values
will return an array of shape (n_samples,)
which the FeatureUnion
object will try to combine with the result of the transformation of the text features later on. In your case, the dimension of the transformed text features is (n_samples, 98)
which cannot be combined with the vector you get for the numeric features.
An easy fix would be to reshape the vector into a 2d array with dimensions (n_samples, 1)
like the following:
def get_numeric_data(x):
return x.number.values.reshape(-1, 1)
Note that I removed the brackets surrounding the expression, as they unnecessarily wrapped the result in a list.
While the above will make your code run, there are still a couple of things about your code that are not quite efficient and can be improved.
First is the expression [record for record in x.text.values]
which is redundant, as x.text.values
would already be enough. The only difference is that the former is a list
object, whereas the latter is a numpy ndarray
which is usually preferred.
Second is what Ben Reiniger already stated in his comment. FeatureUnion
is meant to perform several transformations on the same data and combine the results into a single object. However, it appears that you simply want to transform the text features separately from your numeric ones. In this case, the ColumnTransformer
offers a much simpler and canonical way:
combined_clf = Pipeline([
('transformer', ColumnTransformer([
('vectorizer', Pipeline([
('vect', vect),
('tfidf', tfidf),
('scaler', scl)
]), 'text')
], remainder='passthrough')),
('clf', SGDClassifier(random_state=42, max_iter=int(10 ** 6 / len(X_train)), shuffle=True))
])
What happens above is that ColumnTransformer
only selects the text column and passes it to the pipeline of transformations, and will eventually merge it with the numeric column that was just passed through. Note that it becomes obsolete to define your own selectors as ColumnTransformer
will take care of that by specifying the columns to be transformed by each transformer. See the documentation for more information.
Answered By - afsharov
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.