Issue
I have a dataset which is around 10k and I am splitting the data into a 80:20 ratio in sklearn's train_test_split module...However I fail to understand the reason behind the outputs not matching the original dataset when they are added. for eg here's the size of my dataset created using df.shape (9538, 15)
. Now if I put this into train_test_split I get something like
from sklearn.model_selection import train_test_split
train, test = train_test_split(df_fake,test_size=0.2, random_state=0)
train, val = train_test_split(df_fake,test_size=0.25,random_state=0)
print('Train-',train.shape)
print('Val-',val.shape)
print('Test-',test.shape)
the outputs:-
Train- (7153, 15)
Val- (2385, 15)
Test- (1908, 15)
So if I add the testset with the validation set it comes to - 4293 and when this figure is added to the train set it comes to 11446. whereas I have got the data of only 9.5K.. Is it something that I am doing wrong?
Solution
Your issue is that you're setting the train
variable twice in your code and overwriting it. Sklearn's train_test_split
function splits the data in two parts. So your train
+ val
datasets add up to the correct number. If you want to split three ways, try splitting once and then splitting one of the resulting datasets again. If you're going for an 80-10-10 split, you need to first cut out 20% and then cut that in half again:
from sklearn.model_selection import train_test_split
train, tv = train_test_split(df_fake, test_size=0.2, random_state=0)
test, val = train_test_split(tv, test_size=0.5, random_state=0)
Answered By - mschoder
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.