Issue
I came across the following statement when trying to find the differnce between train_test_split
and StratifiedShuffleSplit
.
When
stratify
is not None train_test_split usesStratifiedShuffleSplit
internally,
I was just wondering why the StratifiedShuffleSplit
from sklearn.model_selection
is used when we can use the stratify
argument available in train_test_split
.
Solution
Mainly, it is done for the sake of the re-usability. Rather than duplicating the code already implemented for StratifiedShuffleSplit
, train_test_split
just calls that class.
For the same reason, when stratify=False
, it uses the model_selection.ShuffleSplit
class (see source code).
Please note that duplicating code is considered a bad practice, because it assumed to inflate maintenance costs, but also considered defect-prone as inconsistent changes to code duplicates can lead to unexpected behavior. Here a reference if you'd like to learn more.
Besides, although they perform the same task, they cannot be always used in the same contexts. For example, train_test_split
cannot be used within a Random or Grid search with sklearn.model_selection.RandomizedSearchCV
or sklearn.model_selection.GridSearchCV
.
The StratifiedShuffleSplit
does. The reason is that the former is not "an iterable yielding (train, test) splits as arrays of indices". While the latter has a method split
that yields (train, test) splits as array of indices.
More info here (see parameter cv).
Answered By - s.dallapalma
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.