Issue
I am using sklearn isolation forest for an anomaly detection task. Isolation forest consists of iTrees. As this paper describes, the nodes of the iTrees are split in the following way: We select any feature (uniformly) randomly and perform a split on a random value of that feature.
But I want to give more weight to some features than the others. So instead of selecting the features with equal probability, I want to draw some features with a higher probability (giving more weight to those features) and other features with a lower probability.
How can I do that? From the source code it seems I have to change the function _generate_bagging_indices
in _bagging.py
, but not sure.
Solution
You can achieve this without changing the source code. Instead, you can tweak your input data by duplicating the features you wish to increase the weight for. If you have a feature appearing twice, the trees will use it twice to split your data, which in practice will mean the same as having doubled the weight of the feature.
In addition to this, you can also choose to reduce the amount of features used by your isolation forest in each tree. This is controlled by the argument max_features
. The default value of 1.0 ensures that every feature will be used for each tree. By reducing it, more trees will be trained without the less frequent features in your input.
Illustration
Load Data
from sklearn.ensemble import IsolationForest
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
data = load_iris()
X = data.data
df = pd.DataFrame(X, columns=data.feature_names)
Default settings
IF = IsolationForest()
IF.fit(df)
preds = IF.predict(df)
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds)
plt.title("Default settings")
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()
Weighted Settings
df1 = df.copy()
weight_feature = 10
for i in range(weight_feature):
df1["duplicated_" + str(i)] = df1["sepal length (cm)"]
IF1 = IsolationForest(max_features=0.3)
IF1.fit(df1)
preds1 = IF1.predict(df1)
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds1)
plt.title("Weighted settings")
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()
As you can see visually, the second option has used the X-axis more intensively to determine which are the outliers.
Answered By - MaximeKan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.