Friday, June 3, 2022

[FIXED] How to give more importance to some features in sklearn Isolation Forest

June 03, 2022 isolation-forest, python, scikit-learn No comments

Issue

I am using sklearn isolation forest for an anomaly detection task. Isolation forest consists of iTrees. As this paper describes, the nodes of the iTrees are split in the following way: We select any feature (uniformly) randomly and perform a split on a random value of that feature.

But I want to give more weight to some features than the others. So instead of selecting the features with equal probability, I want to draw some features with a higher probability (giving more weight to those features) and other features with a lower probability.

How can I do that? From the source code it seems I have to change the function _generate_bagging_indices in _bagging.py, but not sure.

Solution

You can achieve this without changing the source code. Instead, you can tweak your input data by duplicating the features you wish to increase the weight for. If you have a feature appearing twice, the trees will use it twice to split your data, which in practice will mean the same as having doubled the weight of the feature.

In addition to this, you can also choose to reduce the amount of features used by your isolation forest in each tree. This is controlled by the argument max_features. The default value of 1.0 ensures that every feature will be used for each tree. By reducing it, more trees will be trained without the less frequent features in your input.

Illustration

Load Data

from sklearn.ensemble import IsolationForest
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

data = load_iris()
X = data.data
df = pd.DataFrame(X, columns=data.feature_names)

Default settings

IF = IsolationForest()
IF.fit(df)
preds = IF.predict(df)

plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds)
plt.title("Default settings")
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()

Weighted Settings

df1 = df.copy()
weight_feature = 10
for i in range(weight_feature):
    df1["duplicated_" + str(i)] = df1["sepal length (cm)"]

IF1 = IsolationForest(max_features=0.3)
IF1.fit(df1)
preds1 = IF1.predict(df1)

plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds1)
plt.title("Weighted settings")
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()

As you can see visually, the second option has used the X-axis more intensively to determine which are the outliers.

Answered By - MaximeKan

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, June 3, 2022

[FIXED] How to give more importance to some features in sklearn Isolation Forest

Issue

Solution

Illustration

Load Data

Default settings

Weighted Settings

0 comments:

Post a Comment

Popular Posts

Labels