Issue
I am using the StackingClassifier to combine several model pipelines for predicting hospital readmission on the UCI diabetes dataset. Each pipeline works fine on its own, but I keep running into problems when trying to combine them. I want to know why a standalone text classifier will run, while the stacked classifier won't and how I can fix it.
Here is the section that raises the error:
stack_clf = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression()
)
x_train, x_test, y_train, y_test = train_test_split(
pd.concat([
diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
diabetes_data[categorical_data+ordinal_data+scalar_data]
], axis=1
),
diabetes_data["readmitted"]
)
# This line throws the error in the fit function
stack_clf.fit(x_train, y_train).score(x_test, y_test)
ValueError: could not convert string to float: 'bronchitis specified acute chronic'
Now an example of a component classifier that works just fine:
x_train, x_test, y_train, y_test = train_test_split(
diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
diabetes_data["readmitted"]
)
text_pipe.fit(x_train, y_train).score(x_test, y_test)
0.5935
Because it is unclear to me where in the pipeline the error is originating, I have provided the full minimal reproducible example below.
Select Columns
text_data = [
"diag_1_desc",
"diag_2_desc",
"diag_3_desc"
]
scalar_data = [
"num_medications",
"time_in_hospital",
"num_lab_procedures",
"num_procedures",
"number_outpatient",
"number_emergency",
"number_inpatient",
"number_diagnoses",
]
ordinal_data = [
"age"
]
categorical_data = [
"race",
"gender",
"admission_type_id",
"discharge_disposition_id",
"admission_source_id",
"insulin",
"diabetesMed",
"change",
"A1Cresult",
"metformin",
"repaglinide",
"nateglinide",
"chlorpropamide",
"glimepiride",
"glipizide",
"glyburide",
"tolbutamide",
"pioglitazone",
"rosiglitazone",
"acarbose",
"miglitol",
"tolazamide",
"glyburide.metformin",
"glipizide.metformin",
]
Create Logistic Regression Classifier
logreg = LogisticRegression(
solver = "saga",
penalty="elasticnet",
l1_ratio=0.5,
max_iter=1000
)
Create Column Transformers
text_trans = compose.make_column_transformer(
(TfidfVectorizer(ngram_range=(1,2)), "diag_1_desc"),
(TfidfVectorizer(ngram_range=(1,2)), "diag_2_desc"),
(TfidfVectorizer(ngram_range=(1,2)), "diag_3_desc"),
remainder="passthrough",
)
scalar_trans = compose.make_column_transformer(
(
preprocessing.StandardScaler(),
scalar_data
),
remainder="passthrough",
)
cat_trans = compose.make_column_transformer(
(
preprocessing.OneHotEncoder(
sparse=False,
handle_unknown="ignore"
),
categorical_data
),
(
preprocessing.OrdinalEncoder(),
ordinal_data
),
remainder="passthrough",
)
Create Pipeline Estimators
text_pipe = make_pipeline(text_trans, logreg)
scalar_pipe = make_pipeline(scalar_trans, logreg)
cat_pipe = make_pipeline(cat_trans, logreg)
estimators = [
("cat", cat_pipe),
("text", text_pipe),
("scalar", scalar_pipe)
]
Create and Fit Stacking Classifier
stack_clf = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression()
)
x_train, x_test, y_train, y_test = train_test_split(
pd.concat([
diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
diabetes_data[categorical_data+ordinal_data+scalar_data]
], axis=1
),
diabetes_data["readmitted"]
)
stack_clf.fit(x_train, y_train).score(x_test, y_test)
ValueError: could not convert string to float: 'bronchitis specified acute chronic'
My pipeline also relies on two helper functions that I use for preprocessing the text data by removing punctuation and stopwords.
Helper Functions
def preprocess_text(text):
try:
text = re.sub('[^a-zA-Z]', ' ', text)
text = text.lower().split()
text = [word for word in text if not word in set(nltk.corpus.stopwords.words('english'))]
text = [nltk.stem.WordNetLemmatizer().lemmatize(word) for word in text if len(word) > 1]
return ' '.join(text)
except TypeError:
return ''
def preprocess_series(series):
texts = []
for i in range(len(series)):
texts.append(preprocess_text(series[i]))
return pd.Series(texts)
Solution
It looks like your component pipelines don't all work, just the text one. Your other pipelines use a column transformer with remainder='passthrough'
, which means they pass the test columns along untouched, to which the logistic regression will balk.
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.