Wednesday, April 13, 2022

[FIXED] Why Does StackingClassifier Raise Error When Component Classifier Does Not?

April 13, 2022 ensemble-learning, python, scikit-learn No comments

Issue

I am using the StackingClassifier to combine several model pipelines for predicting hospital readmission on the UCI diabetes dataset. Each pipeline works fine on its own, but I keep running into problems when trying to combine them. I want to know why a standalone text classifier will run, while the stacked classifier won't and how I can fix it.

Here is the section that raises the error:

stack_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression()
)

x_train, x_test, y_train, y_test = train_test_split(
    pd.concat([
        diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
        diabetes_data[categorical_data+ordinal_data+scalar_data]
    ], axis=1
    ),
    diabetes_data["readmitted"]                                                
)

# This line throws the error in the fit function
stack_clf.fit(x_train, y_train).score(x_test, y_test)

ValueError: could not convert string to float: 'bronchitis specified acute chronic'

Now an example of a component classifier that works just fine:

x_train, x_test, y_train, y_test = train_test_split(
    diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
    diabetes_data["readmitted"]
)

text_pipe.fit(x_train, y_train).score(x_test, y_test)

0.5935

Because it is unclear to me where in the pipeline the error is originating, I have provided the full minimal reproducible example below.

Select Columns

text_data = [
    "diag_1_desc",
    "diag_2_desc",
    "diag_3_desc"
]

scalar_data = [
    "num_medications",
    "time_in_hospital",
    "num_lab_procedures",
    "num_procedures",
    "number_outpatient",
    "number_emergency",
    "number_inpatient",
    "number_diagnoses",
]

ordinal_data = [
    "age"
]

categorical_data = [
    "race",
    "gender",
    "admission_type_id",
    "discharge_disposition_id",
    "admission_source_id",
    "insulin",
    "diabetesMed",
    "change",
    "A1Cresult",
    "metformin",
    "repaglinide",
    "nateglinide",
    "chlorpropamide",
    "glimepiride",
    "glipizide",
    "glyburide",
    "tolbutamide",
    "pioglitazone",
    "rosiglitazone",
    "acarbose",
    "miglitol",
    "tolazamide",
    "glyburide.metformin",
    "glipizide.metformin",    
]

Create Logistic Regression Classifier

logreg = LogisticRegression(
    solver = "saga",
    penalty="elasticnet",
    l1_ratio=0.5,
    max_iter=1000
)

Create Column Transformers

text_trans = compose.make_column_transformer(
    (TfidfVectorizer(ngram_range=(1,2)), "diag_1_desc"),
    (TfidfVectorizer(ngram_range=(1,2)), "diag_2_desc"),
    (TfidfVectorizer(ngram_range=(1,2)), "diag_3_desc"),
    remainder="passthrough",
)

scalar_trans = compose.make_column_transformer(
    (
        preprocessing.StandardScaler(),
        scalar_data
    ),
    remainder="passthrough",
)

cat_trans = compose.make_column_transformer(
    (
        preprocessing.OneHotEncoder(
            sparse=False,
            handle_unknown="ignore"
        ),
        categorical_data
    ),
    (
        preprocessing.OrdinalEncoder(),
        ordinal_data
    ),
    remainder="passthrough",
)

Create Pipeline Estimators

text_pipe = make_pipeline(text_trans, logreg)
scalar_pipe = make_pipeline(scalar_trans, logreg)
cat_pipe = make_pipeline(cat_trans, logreg)

estimators = [
    ("cat", cat_pipe),
    ("text", text_pipe),
    ("scalar", scalar_pipe)
]

Create and Fit Stacking Classifier

stack_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression()
)

x_train, x_test, y_train, y_test = train_test_split(
    pd.concat([
        diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
        diabetes_data[categorical_data+ordinal_data+scalar_data]
    ], axis=1
    ),
    diabetes_data["readmitted"]                                                
)

stack_clf.fit(x_train, y_train).score(x_test, y_test)

ValueError: could not convert string to float: 'bronchitis specified acute chronic'

My pipeline also relies on two helper functions that I use for preprocessing the text data by removing punctuation and stopwords.

Helper Functions

def preprocess_text(text):
    try:
        text = re.sub('[^a-zA-Z]', ' ', text)
        text = text.lower().split()
        text = [word for word in text if not word in set(nltk.corpus.stopwords.words('english'))]
        text = [nltk.stem.WordNetLemmatizer().lemmatize(word) for word in text if len(word) > 1]
        return ' '.join(text)
    except TypeError:
        return ''

def preprocess_series(series):
    texts = []
    for i in range(len(series)):
        texts.append(preprocess_text(series[i]))
    return pd.Series(texts)

Solution

It looks like your component pipelines don't all work, just the text one. Your other pipelines use a column transformer with remainder='passthrough', which means they pass the test columns along untouched, to which the logistic regression will balk.

Answered By - Ben Reiniger

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, April 13, 2022

[FIXED] Why Does StackingClassifier Raise Error When Component Classifier Does Not?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels