Issue
I can best describe my goal using a synthetic dataset. Suppose I have the following:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
X, y = make_classification(n_samples=1000, n_features=10, n_classes=3,
n_informative=3)
df = pd.DataFrame(X, columns=list('ABCDEFGHIJ'))
X_train, X_test, y_train, y_test = train_test_split(
df, y, test_size=0.3, random_state=42)
X_train.head()
A B C D E F G H I J
541 -0.277848 1.022357 -0.950125 -2.100213 0.883638 0.821387 1.154613 0.075376 1.176242 -0.470087
440 1.089665 0.841446 -1.701004 -1.036256 -1.229357 0.345068 1.876470 -0.750067 0.080685 -1.318271
482 0.016010 0.025488 -1.189296 -1.052935 -0.623029 0.669521 1.518927 0.690019 -0.045486 -0.494186
422 -0.133358 -2.16219 1.170989 -0.942150 1.933444 -0.55118 -0.059908 -0.938672 -0.924097 -0.796185
778 0.901954 1.479360 -2.639176 -2.588845 -0.753915 -1.650621 2.727146 0.075260 1.330432 -0.941594
After conducting a feature importance analysis, the discovered that each of the 3-classes in the dataset can best be predicted using feature subset, as oppose to the whole. For example:
class | optimal predictors
-------+-------------------
0 | A, B, C
1 | D, E, F, G
2 | G, H, I, J
-------+-------------------
At this point, I would like to use 3 one-ve-rest
classifiers to train sub-models, one for each class and using the class's best predictors (as the base models). And then a StackingClassifier
for final prediction.
I have high-level understanding of the StackingClassifier
, where different base models can be trained (e.g. DT, SVC, KNN
etc) and a meta classifier using another model e.g. Logistice Regression
.
In this case however, the base model is one DT
classifier, only that each is to be trained using feature subset best for the class, as above.
Then finally make predictions on the X_test
.
But I am not sure how this can be done. So I give the description of my work using pseudo data as above.
How to design this to train the base models, and a final prediction?
Solution
You can programmatically do what you describe but I am not sure what would be the gain over using a simple Random Forest that internally does all this (feature subselection and fitting etc).
Here is an implementation of what you have described. I have used exactly the same base and stacking model as the ones you mentioned:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
def select_columns(X, columns):
return X[columns]
X, y = make_classification(n_samples=1000, n_features=10, n_classes=3, n_informative=3)
df = pd.DataFrame(X, columns=list('ABCDEFGHIJ'))
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, random_state=42)
feature_subsets = {
0: ['A', 'B', 'C'],
1: ['D', 'E', 'F', 'G'],
2: ['G', 'H', 'I', 'J']
}
# Base model
base_dt_model = DecisionTreeClassifier(random_state=42)
#One-vs-Rest classifiers with feature subsets
classifiers = []
for class_label, features in feature_subsets.items():
model = clone(base_dt_model)
# select features, then apply the unique model
pipeline = Pipeline([
('feature_selection', FunctionTransformer(select_columns, kw_args={'columns': features})),
('classifier', model)
])
classifiers.append(('dt_class_' + str(class_label), pipeline))
# Logistic Regression as the metaclassifier
stack = StackingClassifier(estimators=classifiers, final_estimator=LogisticRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
Answered By - seralouk
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.