Issue
I have a database as the following:
And I would like to obtain a pandas dataframe filtered for the 2 rows per date, based on the top ones that have the highest population. The output should look like this:
I know that pandas offers a formula called nlargest: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html
but I don't think it is usable for this use case. Is there any workaround?
Thanks so much in advance!
Solution
I have mimicked your dataframe as below and provided a way forward to get the desired result.
Your Dataframe:
>>> df
Date country population
0 2019-12-31 A 100
1 2019-12-31 B 10
2 2019-12-31 C 1000
3 2020-01-01 A 200
4 2020-01-01 B 20
5 2020-01-01 C 3500
6 2020-01-01 D 12
7 2020-02-01 D 2000
8 2020-02-01 E 54
Your Desired Solution:
You can use nlargest
method along with set_index
ans groupby
method.
This is what you will get..
>>> df.set_index('country').groupby('Date')['population'].nlargest(2)
Date country
2019-12-31 C 1000
A 100
2020-01-01 C 3500
A 200
2020-02-01 D 2000
E 54
Name: population, dtype: int64
Now, as you want the DataFrame into original state by resetting the index of the DataFrame, which will give you following ..
>>> df.set_index('country').groupby('Date')['population'].nlargest(2).reset_index()
Date country population
0 2019-12-31 C 1000
1 2019-12-31 A 100
2 2020-01-01 C 3500
3 2020-01-01 A 200
4 2020-02-01 D 2000
5 2020-02-01 E 54
Another way around:
With groupby
and apply
function use reset_index
with parameter drop=True
and level=
..
>>> df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=[0,1], drop=True)
# df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=['Date',1], drop=True)
Date country population
0 2019-12-31 C 1000
1 2019-12-31 A 100
2 2020-01-01 C 3500
3 2020-01-01 A 200
4 2020-02-01 D 2000
5 2020-02-01 E 54
Answered By - Karn Kumar
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.