Tuesday, December 28, 2021

[FIXED] Get names of the most important features for Logistic Regression after transformation

December 28, 2021 machine-learning, python, scikit-learn No comments

Issue

I want to get names of the most important features for Logistic regression after transformation.

columns_for_encoding = ['a', 'b', 'c', 'd', 'e', 'f',
                        'g','h','i','j','k','l', 'm', 
                        'n', 'o', 'p']

columns_for_scaling = ['aa', 'bb', 'cc', 'dd', 'ee']

transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown = "ignore"), columns_for_encoding),
                                                        ('Normalizer', Normalizer(), columns_for_scaling)],
                                          remainder='passthrough')

I know that I can do this:

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size = 0.2, random_state=42)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)

clf = LogisticRegression(max_iter = 5000, class_weight = {1: 3.5, 0: 1})
model = clf.fit(x_train, y_train)

importance = model.coef_[0]

# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

But with this I'm getting feature1, feature2, feature3...etc. And after transformation I have around 45k features.

How can I get the list of most important features (before transformation)? I want to know what are the best features for the model. I have a lot of categorical features with 100+ different categories, so after encoding I'm having more features than rows in my dataset. So I want to find out what features can I exclude from my dataset and what features are the best for my model.

IMPORTANT I have other features that are used but not transformed...because of that I put remainder='passthrough'

Solution

As you would already be aware that the whole idea of feature importances is bit tricky for the case of LogisticRegression. You can read more about it from these posts:

I personally found these and other similar posts inconclusive so I am going to avoid this part in my answer and address your main question about feature splitting and aggregating the feature importances (assuming they are available for the split features) using a RandomForestClassifier. I am also assuming that the importance of a parent feature is sum total of that of the child features.

Under these assumptions, we can use the below code to have the importances of the original features. I am using the Palmer Archipelago (Antarctica) penguin data for the illustration.

df = pd.read_csv('./data/penguins_size.csv')
df = df.dropna()
# to comply with the assumption later that column names don't contain _
df.columns = [c.replace('_', '-') for c in df.columns]

X = df.iloc[:, :-1]
y = np.asarray(df.iloc[:, 6] == 'MALE').astype(int)

pd.options.display.width = 0
print(X.head())

species	island	culmen-length-mm	culmen-depth-mm	flipper-length-mm	body-mass-g
Adelie	Torgersen	39.1	18.7	181.0	3750.0
Adelie	Torgersen	39.5	17.4	186.0	3800.0
Adelie	Torgersen	40.3	18.0	195.0	3250.0
Adelie	Torgersen	36.7	19.3	193.0	3450.0
Adelie	Torgersen	39.3	20.6	190.0	3650.0

columns_for_encoding = ['species', 'island']
columns_for_scaling = ['culmen-length-mm', 'culmen-depth-mm']

transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown="ignore"), columns_for_encoding), ('Normalizer', Normalizer(), columns_for_scaling)], remainder='passthrough')

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)

clf = RandomForestClassifier(max_depth=5)
model = clf.fit(x_train, y_train)

importance = model.feature_importances_

# feature names derived from the encoded columns and their individual importances
# encoded cols
enc_col_out = transformerVectoriser.named_transformers_['Vector Cat'].get_feature_names_out()
enc_col_out_imp = importance[transformerVectoriser.output_indices_['Vector Cat']]
# normalized cols
norm_col = transformerVectoriser.named_transformers_['Normalizer'].feature_names_in_
norm_col_imp = importance[transformerVectoriser.output_indices_['Normalizer']]
# remainder cols, require a quick lookup as no transformer object exists for this case
rem_cols = []
for (tname, _, cs) in transformerVectoriser.transformers_:
    if tname == 'remainder': rem_cols = X.columns[cs]; break
rem_col_imp = importance[transformerVectoriser.output_indices_['remainder']]

# storing them in a df for easy manipulation
imp_df = pd.DataFrame({'feature': (list(enc_col_out) + list(norm_col) + list(rem_cols)), 'importance': (list(enc_col_out_imp) + list(norm_col_imp) + list(rem_col_imp))})

# aggregating, assuming that column names don't contain _ just to keep it simple
imp_df['feature'] = imp_df['feature'].apply(lambda x: x.split('_')[0])
imp_agg = imp_df.groupby(by=['feature']).sum()
print(imp_agg)
print(f'Sum of feature importances: {imp_df["importance"].sum()}')

Output:

Answered By - jdsurya

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, December 28, 2021

[FIXED] Get names of the most important features for Logistic Regression after transformation

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels