Issue
I have a dataframe with columns as below:
Name Measurement
0 Blue_Water_Final_Rev_0 3
1 Blue_Water_Final_Rev_1 4
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
4 Red_Water_Initial_Rev_0 6
I want to keep only the rows with the latest rev or rows with "Final" if the other is "Initial". In the case above, my output will be as below:
Name Measurement
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
How can I do this in python in my pandas dataframe? Thanks.
Solution
You can extract the name before "Final" and drop_duplicates
with keep='last'
:
keep = (df['Name']
.str.extract('^(.*)_Final', expand=False)
.drop_duplicates(keep='last')
.dropna()
)
out = df.loc[keep.index]
NB. Assuming the data is sorted by revision.
Output:
Name Measurement
2 Blue_Water_Final_Rev_2 5
3 Red_Water_Final_Rev_0 7
If you want to keep all duplicates of the last revision:
out = df[df['Name'].isin(df.loc[keep.index, 'Name'])]
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.