Monday, November 8, 2021

[FIXED] How to properly build a SGDClassifier with both text and numerical data using FeatureUnion and Pipeline?

November 08, 2021 machine-learning, nlp, pipeline, python, scikit-learn No comments

Issue

I have a features DF that looks like

text	number
text1	0
text2	1
...	...

where the number column is binary and the text column contains texts with ~2k characters in each row. The targets DF contains three classes.

def get_numeric_data(x):
    return [x.number.values]
def get_text_data(x):
    return [record for record in x.text.values]
transfomer_numeric = FunctionTransformer(get_numeric_data)
transformer_text = FunctionTransformer(get_text_data)

and when trying to fit, code below, I get the error File "C:\fakepath\scipy\sparse\construct.py", line 588, in bmat raise ValueError(msg) ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 98, expected 1.. I tried to build functions get_text_data and get_numerical_data in different ways but none helped.

combined_clf = Pipeline([
    ('features', FeatureUnion([
        ('numeric_features', Pipeline([
            ('selector', transfomer_numeric)
        ])),
        ('text_features', Pipeline([
            ('selector', transformer_text),
            ('vect', vect),
            ('tfidf', tfidf),
            ('scaler', scl),
        ]))
    ])),
    ('clf', SGDClassifier(random_state=42,
                          max_iter=int(10 ** 6 / len(X_train)), shuffle=True))
])
gs_clf = GridSearchCV(combined_clf, parameters, cv=5,n_jobs=-1)
gs_clf.fit(X_train, y_train)

Solution

The main problem is the way you are returning the numeric values. x.number.values will return an array of shape (n_samples,) which the FeatureUnion object will try to combine with the result of the transformation of the text features later on. In your case, the dimension of the transformed text features is (n_samples, 98) which cannot be combined with the vector you get for the numeric features.

An easy fix would be to reshape the vector into a 2d array with dimensions (n_samples, 1) like the following:

def get_numeric_data(x):
    return x.number.values.reshape(-1, 1)

Note that I removed the brackets surrounding the expression, as they unnecessarily wrapped the result in a list.

While the above will make your code run, there are still a couple of things about your code that are not quite efficient and can be improved.

First is the expression [record for record in x.text.values] which is redundant, as x.text.values would already be enough. The only difference is that the former is a list object, whereas the latter is a numpy ndarray which is usually preferred.

Second is what Ben Reiniger already stated in his comment. FeatureUnion is meant to perform several transformations on the same data and combine the results into a single object. However, it appears that you simply want to transform the text features separately from your numeric ones. In this case, the ColumnTransformer offers a much simpler and canonical way:

combined_clf = Pipeline([
    ('transformer', ColumnTransformer([
        ('vectorizer', Pipeline([
            ('vect', vect),
            ('tfidf', tfidf),
            ('scaler', scl)
        ]), 'text')
    ], remainder='passthrough')),
    ('clf', SGDClassifier(random_state=42, max_iter=int(10 ** 6 / len(X_train)), shuffle=True))
])

What happens above is that ColumnTransformer only selects the text column and passes it to the pipeline of transformations, and will eventually merge it with the numeric column that was just passed through. Note that it becomes obsolete to define your own selectors as ColumnTransformer will take care of that by specifying the columns to be transformed by each transformer. See the documentation for more information.

Answered By - afsharov

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 8, 2021

[FIXED] How to properly build a SGDClassifier with both text and numerical data using FeatureUnion and Pipeline?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels