Issue
I have the following DataFrame:
Time: Value:
1000 4.6 - Keep because +ve
2000 3.2 - Keep because +ve
3000 -1.1 - Remove because -ve AND in consecutive group of negatives of size < 3
4000 -0.4 - Remove because -ve AND in consecutive group of negatives of size < 3
5000 0.8 - Keep because +ve
6000 -1.5 - Keep because -ve AND in consecutive group of negatives of size >= 3
7000 -2.1 - Keep because -ve AND in consecutive group of negatives of size >= 3
8000 -3.4 - Keep because -ve AND in consecutive group of negatives of size >= 3
9000 -1.5 - Keep because -ve AND in consecutive group of negatives of size >= 3
10000 -0.3 - Keep because -ve AND in consecutive group of negatives of size >= 3
11000 1.6 - Keep because +ve
12000 2.8 - Keep because +ve
13000 4.0 - Keep because +ve
I want to produce the following DataFrame from it: (Removing small groups of negative values where the group size is less than n, n=3 in this eg.)
Time: Value:
1000 4.6
2000 3.2
5000 0.8
6000 -1.5
7000 -2.1
8000 -3.4
9000 -1.5
10000 -0.3
11000 1.6
12000 2.8
13000 4.0
In the absence of a clever Pandas solution, I am planning to write a element-wise loop which iterates through each row marking consecutive negative values (in a new column), and immediately removing those marked rows when the -ve sequence is interrupted by a +ve value. (Deletion will not occur if the -ve sequence reaches the minimum size of n. The example above shows n=3).
Having deleted the marked rows, I will carry on from where I left off, until the end of the original frame is reached.
I know the proposed solution is not elegant in the Pandas world(!), but cannot figure out how a purist Pandas solution would work. Maybe something using groups or shift?
Solution
group the negative values together and get their number, use this to form a mask for boolean indexing:
n = 3
# non-negative values
m1 = df['Value'].ge(0)
# count of negative values per group of successive ones
m2 = (~m).groupby(m1.cumsum()).transform('sum').ge(n)
out = df[m1|m2]
output:
Time Value
0 1000 4.6
1 2000 3.2
4 5000 0.8
5 6000 -1.5
6 7000 -2.1
7 8000 -3.4
8 9000 -1.5
9 10000 -0.3
10 11000 1.6
11 12000 2.8
12 13000 4.0
intermediates:
Time Value m1 m2 m1|m2
0 1000 4.6 True False True
1 2000 3.2 True False True
2 3000 -1.1 False False False
3 4000 -0.4 False False False
4 5000 0.8 True True True
5 6000 -1.5 False True True
6 7000 -2.1 False True True
7 8000 -3.4 False True True
8 9000 -1.5 False True True
9 10000 -0.3 False True True
10 11000 1.6 True False True
11 12000 2.8 True False True
12 13000 4.0 True False True
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.