Issue
I have an unbalanced data and I want to perform a random subsampling on the majority class where each subsample will be the same size as the minority class ... I think this is already implemented on Weka and Matlab, is there an equivalent to this on sklearn ?
Solution
Say your data looks like something generated from this code:
import numpy as np
x = np.random.randn(100, 3)
y = np.array([int(i % 5 == 0) for i in range(100)])
(only a 1/5th of y
is 1, which is the minority class).
To find the size of the minority class, do:
>>> np.sum(y == 1)
20
To find the subset that consists of the majority class, do:
majority_x, majority_y = x[y == 0, :], y[y == 0]
To find a random subset of size 20, do:
inds = np.random.choice(range(majority_x.shape[0]), 20)
followed by
majority_x[inds, :]
and
majority_y[inds]
Answered By - Ami Tavory
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.