Saturday, December 16, 2023

[FIXED] StackingClassifier with base-models trained on feature subsets

December 16, 2023 classification, ensemble-learning, machine-learning, python, scikit-learn No comments

Issue

I can best describe my goal using a synthetic dataset. Suppose I have the following:

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=1000, n_features=10, n_classes=3,
                             n_informative=3)

df = pd.DataFrame(X, columns=list('ABCDEFGHIJ'))

X_train, X_test, y_train, y_test = train_test_split(
    df, y, test_size=0.3, random_state=42)

X_train.head()
         A       B           C        D         E       F          G         H       I        J
541 -0.277848 1.022357 -0.950125 -2.100213  0.883638 0.821387  1.154613  0.075376  1.176242 -0.470087
440  1.089665 0.841446 -1.701004 -1.036256 -1.229357 0.345068  1.876470 -0.750067  0.080685 -1.318271
482  0.016010 0.025488 -1.189296 -1.052935 -0.623029 0.669521  1.518927  0.690019 -0.045486 -0.494186
422 -0.133358 -2.16219  1.170989 -0.942150  1.933444 -0.55118 -0.059908 -0.938672 -0.924097 -0.796185
778  0.901954 1.479360 -2.639176 -2.588845 -0.753915 -1.650621 2.727146  0.075260  1.330432 -0.941594

After conducting a feature importance analysis, the discovered that each of the 3-classes in the dataset can best be predicted using feature subset, as oppose to the whole. For example:

class  | optimal predictors
-------+-------------------
   0   |  A, B, C
   1   |  D, E, F, G
   2   |  G, H, I, J
-------+-------------------

At this point, I would like to use 3 one-ve-rest classifiers to train sub-models, one for each class and using the class's best predictors (as the base models). And then a StackingClassifier for final prediction.

I have high-level understanding of the StackingClassifier, where different base models can be trained (e.g. DT, SVC, KNN etc) and a meta classifier using another model e.g. Logistice Regression.

In this case however, the base model is one DT classifier, only that each is to be trained using feature subset best for the class, as above.

Then finally make predictions on the X_test.

But I am not sure how this can be done. So I give the description of my work using pseudo data as above.

How to design this to train the base models, and a final prediction?

Solution

You can programmatically do what you describe but I am not sure what would be the gain over using a simple Random Forest that internally does all this (feature subselection and fitting etc).

Here is an implementation of what you have described. I have used exactly the same base and stacking model as the ones you mentioned:

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

def select_columns(X, columns):
    return X[columns]


X, y = make_classification(n_samples=1000, n_features=10, n_classes=3, n_informative=3)
df = pd.DataFrame(X, columns=list('ABCDEFGHIJ'))
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, random_state=42)


feature_subsets = {
    0: ['A', 'B', 'C'],
    1: ['D', 'E', 'F', 'G'],
    2: ['G', 'H', 'I', 'J']
}

# Base model
base_dt_model = DecisionTreeClassifier(random_state=42)

#One-vs-Rest classifiers with feature subsets
classifiers = []
for class_label, features in feature_subsets.items():

    model = clone(base_dt_model)
    
    # select features, then apply the unique model
    pipeline = Pipeline([
        ('feature_selection', FunctionTransformer(select_columns, kw_args={'columns': features})),
        ('classifier', model)
    ])
    
    classifiers.append(('dt_class_' + str(class_label), pipeline))

# Logistic Regression as the metaclassifier
stack = StackingClassifier(estimators=classifiers, final_estimator=LogisticRegression())

stack.fit(X_train, y_train)

y_pred = stack.predict(X_test)

Answered By - seralouk

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, December 16, 2023

[FIXED] StackingClassifier with base-models trained on feature subsets

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels