Issue
I've looked at the Sklearn stratified sampling docs as well as the pandas docs and also Stratified samples from Pandas and sklearn stratified sampling based on a column but they do not address this issue.
Im looking for a fast pandas/sklearn/numpy way to generate stratified samples of size n from a dataset. However, for rows with less than the specified sampling number, it should take all of the entries.
Concrete example:
Thank you! :)
Solution
Use min
when passing the number to sample. Consider the dataframe df
df = pd.DataFrame(dict(
A=[1, 1, 1, 2, 2, 2, 2, 3, 4, 4],
B=range(10)
))
df.groupby('A', group_keys=False).apply(lambda x: x.sample(min(len(x), 2)))
A B
1 1 1
2 1 2
3 2 3
6 2 6
7 3 7
9 4 9
8 4 8
Answered By - piRSquared
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.