Sunday, November 7, 2021

[FIXED] How to delete rows based on threshold, but organized by another column?

November 07, 2021 conditional-statements, dataframe, pandas, python No comments

Issue

I am trying to take my dataframe and delete each row, corresponding to values in another row, based on whether or not each category of rows contains a value exceeding a threshold. While I feel this should be a simple "if" conditional statement, I am confused over whether this is actually as simple as I think it might be, or if this is much more complex. I am using python and pandas. I will demonstrate with an example below to better communicate this problem.

I have the following dataframe:

Date              Category      Value
---------------------------------------
2015-06-02               1          2
2015-06-03               1          9
2015-06-04               1          2
2015-06-05               2          2
2015-06-06               2          8
2015-06-07               2         11
2015-06-08               2          2
2015-06-09               4          2
2015-06-10               4          5
2015-06-11               4         12
2015-06-12               4          2
2015-06-13               6          2
2015-06-14               6          8 
2015-06-15               6          2 
2015-06-16               8          2
2015-06-17               8          6
2015-06-18               8         10
2015-06-19               8          2

As you can see, rows are categorized by a certain ID corresponding to the "Category" column. So in this example there are 5 "Categories": "1", "2", "4", "6", "8". It does not matter that they do not increase by 1 for each category, so long as the categories are kept distinct. And so, you will see a patter, given that this is date chronological data, for each category, the value starts at 2, has some values in between, however many, and ends at 2. So each category has values "in between" values of 2. What I want to do is take the dataframe and remove the categories where between the two values of 2, for that unique category, where there no value that is >= 10. As you can see, this would mean removing Category 1 (it has a 9 between the 2's), and removing Category 6 (it has an 8 between the 2's). Of course, there can be any number of values in between the 2's, but at least one of them must be >= 10 And so I would want to produce the following dataframe:

Date              Category      Value
---------------------------------------
2015-06-05               2          2
2015-06-06               2          8
2015-06-07               2         11
2015-06-08               2          2
2015-06-09               4          2
2015-06-10               4          5
2015-06-11               4         12
2015-06-12               4          2
2015-06-16               8          2
2015-06-17               8          6
2015-06-18               8         10
2015-06-19               8          2

How can this be done in python? While I would think I would write a conditional statement that says "if value in df['Value'] < 10, then remove", but I am not sure how to organize this based on the "Category" column like I mentioned.

Solution

Since the groups start/end with 2/3, the condition df['Value'].ge(10) already search for those rows in the middle. You only need to do groupby().transform('any') on that condition:

mask = df['Value'].ge(10).groupby(df['Category']).transform('any')

df[mask]

Output:

          Date  Category  Value
3   2015-06-05         2      3
4   2015-06-06         2      8
5   2015-06-07         2     11
6   2015-06-08         2      3
7   2015-06-09         4      3
8   2015-06-10         4      5
9   2015-06-11         4     12
10  2015-06-12         4      3
14  2015-06-16         8      3
15  2015-06-17         8      6
16  2015-06-18         8     10
17  2015-06-19         8      3

Answered By - Quang Hoang

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, November 7, 2021

[FIXED] How to delete rows based on threshold, but organized by another column?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels