Issue
I have a dataframe of few hundreds rows , that can be grouped to ids as follows:
df = Val1 Val2 Val3 Id
2 2 8 b
1 2 3 a
5 7 8 z
5 1 4 a
0 9 0 c
3 1 3 b
2 7 5 z
7 2 8 c
6 5 5 d
...
5 1 8 a
4 9 0 z
1 8 2 z
I want to use GridSearchCV , but with a custom CV that will assure that all the rows from the same ID will always be on the same set. So either all the rows if a are in the test set , or all of them are in the train set - and so for all the different IDs.
I want to have 5 folds - so 80% of the ids will go to the train and 20% to the test. I understand that it can't guarentee that all folds will have the exact same amount of rows - since one ID might have more rows than the other.
What is the best way to do so?
Solution
As stated, you can provide cv
with an iterator. You can use GroupShuffleSplit(). For example, once you use it to split your dataset, you can put the result within GridSearchCV()
for the cv
parameter.
Answered By - thomaskolasa
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.