Issue
I have a large dataset with multi-class classification ( 3 classes), I want to take sub-sample of data i.e take 200 records belonging to each class and the upon that data, I want split the data.
Say the 3 classes are cat
, dog
, cow
. I want to apply a split on the subset of data where there are 200 records selected out of a large dataset for each of the class cat
, dog
, cow
to train the ML model.
- cat - 200 observations
- dog - 200 observations
- cow - 200 observations
This is the code line for splitting the data:
# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.3,
random_state = 42)
How can I select X
and y
such that it has 200 records for each of the class?
Solution
Take equal no. of samples based on each class
df_cats = df[df['class'] == 'cat'][:200]
df_dogs = df[df['class'] == 'dog'][:200]
df_cows = df[df['class'] == 'cow'][:200]
Then concatenate dataframes
df_new = pd.concat([df_cats, df_dogs, df_cows])
From there split it into X and y
Answered By - imdevskp
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.