Tuesday, May 10, 2022

[FIXED] StackingClassifier Raises Exception 'numpy.ndarray' object has no attribute 'columns'

May 10, 2022 numpy, python, scikit-learn No comments

Issue

I am trying to train a StackingClassifier in Sklearn, but I keep running into this error where the fit method seems to be complaining about me having passed it numpy arrays. To my knowledge, this is how all the fit methods in sklearn are supposed to work. I read and followed the example from the documentation and expanded on it to include a more complex and comprehensive pipeline that would process categorical, ordinal, scalar, and text data.

Sorry in advance for the lengthy code sample, but I felt it was necessary to provide a complete reproducible example. Simply breaking down the pipeline into its constituent estimators and test those individually did not raise any exceptions, so I figure that the error somehow comes from the gestalt estimator.

Select Features

categorical_data = [
    "race",
    "gender",
    "admission_type_id",
    "discharge_disposition_id",
    "admission_source_id",
    "insulin",
    "diabetesMed",
    "change",
    "payer_code",
    "A1Cresult",
    "metformin",
    "repaglinide",
    "nateglinide",
    "chlorpropamide",
    "glimepiride",
    "glipizide",
    "glyburide",
    "tolbutamide",
    "pioglitazone",
    "rosiglitazone",
    "acarbose",
    "miglitol",
    "tolazamide",
    "glyburide.metformin",
    "glipizide.metformin",    
]

ordinal_data = [
    "age"
]

scalar_data = [
    "num_medications",
    "time_in_hospital",
    "num_lab_procedures",
    "num_procedures",
    "number_outpatient",
    "number_emergency",
    "number_inpatient",
    "number_diagnoses",
]

text_data = [
    "diag_1_desc",
    "diag_2_desc",
    "diag_3_desc"
]

Create Column Transformers

impute_trans = compose.make_column_transformer(
    (
        impute.SimpleImputer(
            strategy="constant",
            fill_value="missing"
        ),
        categorical_data
    )
)

encode_trans = compose.make_column_transformer(
    (
        preprocessing.OneHotEncoder(
            sparse=False,
            handle_unknown="ignore"
        ),
        categorical_data
    ),
    (
        preprocessing.OrdinalEncoder(),
        ordinal_data
    )
)

scalar_trans = compose.make_column_transformer(
    (preprocessing.StandardScaler(), scalar_data),
)

text_trans = compose.make_column_transformer(
    (TfidfVectorizer(ngram_range=(1,2)), "diag_1_desc"),
    (TfidfVectorizer(ngram_range=(1,2)), "diag_2_desc"),
    (TfidfVectorizer(ngram_range=(1,2)), "diag_3_desc"),
)

Create Estimators

cat_pre_pipe = make_pipeline(impute_trans, encode_trans)

logreg = LogisticRegression(
    solver = "saga",
    penalty="elasticnet",
    l1_ratio=0.5,
    max_iter=1000
)

text_pipe = make_pipeline(text_trans, logreg)
scalar_pipe = make_pipeline(scalar_trans, logreg)
cat_pipe = make_pipeline(cat_pre_pipe, logreg)

estimators = [
    ("cat", cat_pipe),
    ("text", text_pipe),
    ("scalar", scalar_pipe)
]

Create Stacking Classifier

stack_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=logreg
)

diabetes_data = pd.read_csv("8k_diabetes.csv", delimiter=',')

x_train, x_test, y_train, y_test = train_test_split(
    pd.concat([
        preprocess_dataframe(diabetes_data[text_data]),
        diabetes_data[categorical_data + scalar_data]
    ], axis=1),
    diabetes_data["readmitted"].astype(int)
)

stack_clf.fit(x_train, y_train)

Complete Stack Trace

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/utils/__init__.py:409, in _get_column_indices(X, key)
    408 try:
--> 409     all_columns = X.columns
    410 except AttributeError:

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Input In [19], in <cell line: 1>()
----> 1 stack_clf.fit(x_train, y_train)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py:488, in StackingClassifier.fit(self, X, y, sample_weight)
    486 self._le = LabelEncoder().fit(y)
    487 self.classes_ = self._le.classes_
--> 488 return super().fit(X, self._le.transform(y), sample_weight)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py:158, in _BaseStacking.fit(self, X, y, sample_weight)
    153 stack_method = [self.stack_method] * len(all_estimators)
    155 # Fit the base estimators on the whole training data. Those
    156 # base estimators will be used in transform, predict, and
    157 # predict_proba. They are exposed publicly.
--> 158 self.estimators_ = Parallel(n_jobs=self.n_jobs)(
    159     delayed(_fit_single_estimator)(clone(est), X, y, sample_weight)
    160     for est in all_estimators
    161     if est != "drop"
    162 )
    164 self.named_estimators_ = Bunch()
    165 est_fitted_idx = 0

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/parallel.py:1043, in Parallel.__call__(self, iterable)
   1034 try:
   1035     # Only set self._iterating to True if at least a batch
   1036     # was dispatched. In particular this covers the edge
   (...)
   1040     # was very quick and its callback already dispatched all the
   1041     # remaining jobs.
   1042     self._iterating = False
-> 1043     if self.dispatch_one_batch(iterator):
   1044         self._iterating = self._original_iterator is not None
   1046     while self.dispatch_one_batch(iterator):

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/parallel.py:861, in Parallel.dispatch_one_batch(self, iterator)
    859     return False
    860 else:
--> 861     self._dispatch(tasks)
    862     return True

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/parallel.py:779, in Parallel._dispatch(self, batch)
    777 with self._lock:
    778     job_idx = len(self._jobs)
--> 779     job = self._backend.apply_async(batch, callback=cb)
    780     # A job can complete so quickly than its callback is
    781     # called before we get here, causing self._jobs to
    782     # grow. To ensure correct results ordering, .insert is
    783     # used (rather than .append) in the following line
    784     self._jobs.insert(job_idx, job)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/_parallel_backends.py:208, in SequentialBackend.apply_async(self, func, callback)
    206 def apply_async(self, func, callback=None):
    207     """Schedule a func to be run"""
--> 208     result = ImmediateResult(func)
    209     if callback:
    210         callback(result)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/_parallel_backends.py:572, in ImmediateResult.__init__(self, batch)
    569 def __init__(self, batch):
    570     # Don't delay the application, to avoid keeping the input
    571     # arguments in memory
--> 572     self.results = batch()

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/parallel.py:262, in BatchedCalls.__call__(self)
    258 def __call__(self):
    259     # Set the default nested backend to self._backend but do not set the
    260     # change the default number of processes to -1
    261     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262         return [func(*args, **kwargs)
    263                 for func, args, kwargs in self.items]

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/parallel.py:262, in <listcomp>(.0)
    258 def __call__(self):
    259     # Set the default nested backend to self._backend but do not set the
    260     # change the default number of processes to -1
    261     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262         return [func(*args, **kwargs)
    263                 for func, args, kwargs in self.items]

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/utils/fixes.py:216, in _FuncWrapper.__call__(self, *args, **kwargs)
    214 def __call__(self, *args, **kwargs):
    215     with config_context(**self.config):
--> 216         return self.function(*args, **kwargs)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/ensemble/_base.py:42, in _fit_single_estimator(estimator, X, y, sample_weight, message_clsname, message)
     40 else:
     41     with _print_elapsed_time(message_clsname, message):
---> 42         estimator.fit(X, y)
     43 return estimator

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/pipeline.py:390, in Pipeline.fit(self, X, y, **fit_params)
    364 """Fit the model.
    365 
    366 Fit all the transformers one after the other and transform the
   (...)
    387     Pipeline with fitted steps.
    388 """
    389 fit_params_steps = self._check_fit_params(**fit_params)
--> 390 Xt = self._fit(X, y, **fit_params_steps)
    391 with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
    392     if self._final_estimator != "passthrough":

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/pipeline.py:348, in Pipeline._fit(self, X, y, **fit_params_steps)
    346     cloned_transformer = clone(transformer)
    347 # Fit or load from cache the current transformer
--> 348 X, fitted_transformer = fit_transform_one_cached(
    349     cloned_transformer,
    350     X,
    351     y,
    352     None,
    353     message_clsname="Pipeline",
    354     message=self._log_message(step_idx),
    355     **fit_params_steps[name],
    356 )
    357 # Replace the transformer of the step with the fitted
    358 # transformer. This is necessary when loading the transformer
    359 # from the cache.
    360 self.steps[step_idx] = (name, fitted_transformer)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/memory.py:349, in NotMemorizedFunc.__call__(self, *args, **kwargs)
    348 def __call__(self, *args, **kwargs):
--> 349     return self.func(*args, **kwargs)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/pipeline.py:893, in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    891 with _print_elapsed_time(message_clsname, message):
    892     if hasattr(transformer, "fit_transform"):
--> 893         res = transformer.fit_transform(X, y, **fit_params)
    894     else:
    895         res = transformer.fit(X, y, **fit_params).transform(X)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/pipeline.py:434, in Pipeline.fit_transform(self, X, y, **fit_params)
    432 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
    433 if hasattr(last_step, "fit_transform"):
--> 434     return last_step.fit_transform(Xt, y, **fit_params_last_step)
    435 else:
    436     return last_step.fit(Xt, y, **fit_params_last_step).transform(Xt)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py:672, in ColumnTransformer.fit_transform(self, X, y)
    670 self._check_n_features(X, reset=True)
    671 self._validate_transformers()
--> 672 self._validate_column_callables(X)
    673 self._validate_remainder(X)
    675 result = self._fit_transform(X, y, _fit_transform_one)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py:352, in ColumnTransformer._validate_column_callables(self, X)
    350         columns = columns(X)
    351     all_columns.append(columns)
--> 352     transformer_to_input_indices[name] = _get_column_indices(X, columns)
    354 self._columns = all_columns
    355 self._transformer_to_input_indices = transformer_to_input_indices

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/utils/__init__.py:411, in _get_column_indices(X, key)
    409     all_columns = X.columns
    410 except AttributeError:
--> 411     raise ValueError(
    412         "Specifying the columns using strings is only "
    413         "supported for pandas DataFrames"
    414     )
    415 if isinstance(key, str):
    416     columns = [key]

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

Full Pipeline Diagram

Solution

Your categorical pipeline chains two column transformers together. After the first one, the output is a numpy array, but then the second one cannot select transformers by column name as you've requested. Notice the final error message is more informative here, ValueError: Specifying the columns using strings is only supported for pandas DataFrames.

I'd suggest using one column transformer with separate pipelines instead of one pipeline with multiple columntransformers for this reason.

Answered By - Ben Reiniger

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, May 10, 2022

[FIXED] StackingClassifier Raises Exception 'numpy.ndarray' object has no attribute 'columns'

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels