Issue
I have used GroupShuffleSplit
in the past and it has worked OK. But now I am trying to split based on a column and it's producing overlap between the test and train data. This is what I am running
val_inds, test_inds = next(GroupShuffleSplit(test_size=0.5,
n_splits=2,).split(df, groups=df['cl_uid'].values))
df_val = df[df.index.isin(val_inds)]
df_test = df[df.index.isin(test_inds)]
# this value is not zero
len(set(df_val.cl_uid).intersection(set(df_test.cl_uid)))
Any idea what could be going on?
sklearn
version 0.24.1 and Python
version 3.6
Solution
The return of GroupShuffleSplit
is the array indices so if you want to split your DataFrame you should use .iloc
to filter.
df_val = df.iloc[val_inds]
df_test = df.iloc[test_inds]
If you mistakenly try to use the index
to filter, then you are assuming that you have a non-duplicated RangeIndex
that begins at 0. If that is not the case this filtering is bound to fail.
import pandas as pd
from sklearn.model_selection import GroupShuffleSplit
# DataFrame with a non-RangeIndex
df = pd.DataFrame({'clust_id': [1,1,1,2,2,2,2]}, index=[1,2,1,2,1,2,3])
val_inds, test_inds = next(GroupShuffleSplit(test_size=0.5, n_splits=2,).split(df, groups=df['clust_id']))
Correct splitting
df_val = df.iloc[val_inds]
# clust_id
#2 2
#1 2
#2 2
#3 2
df_test = df.iloc[test_inds]
# clust_id
#1 1
#2 1
#1 1
Incorrect splitting, confuses index labels with array-position labels
df[df.index.isin(val_inds)]
# clust_id
#3 2
df[df.index.isin(test_inds)]
# clust_id
#1 1
#2 1
#1 1
#2 2
#1 2
#2 2
Answered By - ALollz
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.