Issue
I want to get names of the most important features for Logistic regression after transformation.
columns_for_encoding = ['a', 'b', 'c', 'd', 'e', 'f',
'g','h','i','j','k','l', 'm',
'n', 'o', 'p']
columns_for_scaling = ['aa', 'bb', 'cc', 'dd', 'ee']
transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown = "ignore"), columns_for_encoding),
('Normalizer', Normalizer(), columns_for_scaling)],
remainder='passthrough')
I know that I can do this:
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size = 0.2, random_state=42)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)
clf = LogisticRegression(max_iter = 5000, class_weight = {1: 3.5, 0: 1})
model = clf.fit(x_train, y_train)
importance = model.coef_[0]
# summarize feature importance
for i,v in enumerate(importance):
print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
But with this I'm getting feature1, feature2, feature3...etc. And after transformation I have around 45k features.
How can I get the list of most important features (before transformation)? I want to know what are the best features for the model. I have a lot of categorical features with 100+ different categories, so after encoding I'm having more features than rows in my dataset. So I want to find out what features can I exclude from my dataset and what features are the best for my model.
IMPORTANT
I have other features that are used but not transformed...because of that I put remainder='passthrough'
Solution
As you would already be aware that the whole idea of feature importances is bit tricky for the case of LogisticRegression
. You can read more about it from these posts:
- How to find the importance of the features for a logistic regression model?
- Feature Importance in Logistic Regression for Machine Learning Interpretability
- How to Calculate Feature Importance With Python
I personally found these and other similar posts inconclusive so I am going to avoid this part in my answer and address your main question about feature splitting and aggregating the feature importances (assuming they are available for the split features) using a RandomForestClassifier
. I am also assuming that the importance of a parent feature is sum total of that of the child features.
Under these assumptions, we can use the below code to have the importances of the original features. I am using the Palmer Archipelago (Antarctica) penguin data for the illustration.
df = pd.read_csv('./data/penguins_size.csv')
df = df.dropna()
# to comply with the assumption later that column names don't contain _
df.columns = [c.replace('_', '-') for c in df.columns]
X = df.iloc[:, :-1]
y = np.asarray(df.iloc[:, 6] == 'MALE').astype(int)
pd.options.display.width = 0
print(X.head())
species | island | culmen-length-mm | culmen-depth-mm | flipper-length-mm | body-mass-g |
---|---|---|---|---|---|
Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 |
Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 |
Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 |
Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 |
Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 |
columns_for_encoding = ['species', 'island']
columns_for_scaling = ['culmen-length-mm', 'culmen-depth-mm']
transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown="ignore"), columns_for_encoding), ('Normalizer', Normalizer(), columns_for_scaling)], remainder='passthrough')
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)
clf = RandomForestClassifier(max_depth=5)
model = clf.fit(x_train, y_train)
importance = model.feature_importances_
# feature names derived from the encoded columns and their individual importances
# encoded cols
enc_col_out = transformerVectoriser.named_transformers_['Vector Cat'].get_feature_names_out()
enc_col_out_imp = importance[transformerVectoriser.output_indices_['Vector Cat']]
# normalized cols
norm_col = transformerVectoriser.named_transformers_['Normalizer'].feature_names_in_
norm_col_imp = importance[transformerVectoriser.output_indices_['Normalizer']]
# remainder cols, require a quick lookup as no transformer object exists for this case
rem_cols = []
for (tname, _, cs) in transformerVectoriser.transformers_:
if tname == 'remainder': rem_cols = X.columns[cs]; break
rem_col_imp = importance[transformerVectoriser.output_indices_['remainder']]
# storing them in a df for easy manipulation
imp_df = pd.DataFrame({'feature': (list(enc_col_out) + list(norm_col) + list(rem_cols)), 'importance': (list(enc_col_out_imp) + list(norm_col_imp) + list(rem_col_imp))})
# aggregating, assuming that column names don't contain _ just to keep it simple
imp_df['feature'] = imp_df['feature'].apply(lambda x: x.split('_')[0])
imp_agg = imp_df.groupby(by=['feature']).sum()
print(imp_agg)
print(f'Sum of feature importances: {imp_df["importance"].sum()}')
Output:
Answered By - jdsurya
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.