Monday, December 13, 2021

[FIXED] Specific complicated filtering of Pandas dataframe rows

December 13, 2021 dataframe, for-loop, jupyter-notebook, pandas, python No comments

Issue

The data has many columns but the ones in question are as follows:

 MR     Version
GB1       Package
GB5       Package
GB9       3.5
GB5       3.3
GB1       Package
GB9       1.5
GB359     9.1
GB1       Package
GB99      5.5
...

MR (model) names are repeating and the Package in Version column is also repeating. I need to first access all rows with Version == Package,

then take their MR model name for instance GB5
then find all other rows with the same MR model name and
finally check if those other rows (with same MR model name) have a value of Version column different from Package(!= Package). Those who have I need to classify as good and those who have not I need to classify as bad.

For instance, from the example data above MR model GB5 has both a Package and non Package cells hence this model is good, and model GB1 has only Package values in the version column hence it is bad.

For MRs that have only integer values in the Version column such as GB9 we do not care in this task.

Usually those entries are next to each other and there is two of ever model, usually, so I developed a loop to successfully solve my problem below by selecting every two rows from the dataframe, but now I discovered that in some cases these entries are not next to each other so I need a better solution which escapes me. Any help is greatly appreciated, Thank you all. In my code below MR is replaced by Author but it does not matter.

good_aut = []
bad_aut = []
for i, g in merged_df.groupby(merged_df.index // 2): # takes every two rows
    if g.iloc[0]['Version'] == 'Package':            # if row 1 is a package citation
        if g.iloc[0]['Author'] == g.iloc[1]['Author']: # check if row 1 and 2 authors match
            if g.iloc[1]['Version'] != 'Package':       # finally check if row 2 citation is not package, hence it is GAP citation
                print(g)
                good_aut.append(g.iloc[0]['Author']) # if all conditions are met we add this author to the good list, once for every occurence
            else:
                bad_aut.append(g.iloc[0]['Author'])
        else:
            bad_aut.append(g.iloc[0]['Author'])

Solution

It is not clear. Do you expect Package to be present in addition to other values?

if yes

You can groupby MR and check if Package is present together with other values:

def good_or_bad(s):
    s=set(s)
    if 'Package' in s and len(s.difference(['Package']))>0:
        return 'good'
    return 'bad'
df.groupby('MR')['Version'].apply(good_or_bad)

output:

MR
GB1       bad
GB359     bad
GB5      good
GB9       bad
GB99      bad
Name: Version, dtype: object

if no

You can groupby MR and check if values other than Package are present:

(df.groupby('MR')['Version']
 .apply(lambda s: len(set(s).difference(['Package']))>0)
 .map({True: 'good', False: 'bad'})
)

output:

MR
GB1       bad
GB359    good
GB5      good
GB9      good
GB99     good
Name: Version, dtype: object

I want all three possibilities

def good_or_bad(s):
    s=set(s)
    if len(s.difference(['Package']))>0:
        if 'Package' in s:
            return 'good'
        return 'other'
    return 'bad'
df.groupby('MR')['Version'].apply(good_or_bad)

output:

MR
GB1        bad
GB359    other
GB5       good
GB9      other
GB99     other
Name: Version, dtype: object

Answered By - mozway

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 13, 2021

[FIXED] Specific complicated filtering of Pandas dataframe rows

Issue

Solution

if yes

if no

I want all three possibilities

0 comments:

Post a Comment

Popular Posts

Labels