Friday, May 27, 2022

[FIXED] How to create a ranking variable/function for different periods in a panel data?

May 27, 2022 pandas, python, ranking-functions No comments

Issue

I have a dataset, df, that looks like this:

Date	Code	City	State	Population	Quantity	QTDPERCAPITA
2020-01	11001	Los Angeles	CA	5000000	100000	0.02
2020-02	11001	Los Angeles	CA	5000000	125000	0.025
2020-03	11001	Los Angeles	CA	5000000	135000	0.027
2020-01	12002	Houston	TX	3000000	150000	0.05
2020-02	12002	Houston	TX	3000000	100000	0.033
2020-03	12002	Houston	TX	3000000	200000	0.066
...	...	...	...	...	...	...
2021-07	11001	Los Angeles	CA	5500499	340000	0.062
2021-07	12002	Houston	TX	3250012	211000	0.065

WhereQTDPERCAPITA is simply Quantity/Population. I have multiple cities (4149 to be more precise).

The quantities change according to every month, and so does the population.

I would like to create a new variable that serve as a ranking, ranging from [0,1], where 0 is the city with the lowest QTDPERCAPITA in that month, and 1 is the city with the most quantity per capita in that month. Essentially, I want to create a new column that looks like this:

Date	Code	City	State	Population	Quantity	QTDPERCAPITA	RANKING
2020-01	11001	Los Angeles	CA	5000000	100000	0.02	0
2020-02	11001	Los Angeles	CA	5000000	125000	0.025	0
2020-03	11001	Los Angeles	CA	5000000	135000	0.027	0
2020-01	12002	Houston	TX	3000000	150000	0.05	1
2020-02	12002	Houston	TX	3000000	100000	0.033	1
2020-03	12002	Houston	TX	3000000	200000	0.066	1
...	...	...	...	...	...	...	...
2021-07	11001	Los Angeles	CA	5500499	340000	0.062	0
2021-07	12002	Houston	TX	3250012	211000	0.065	1

How can I create this column such that the RANKING changes every month? I was thinking of a for loop that extracts the QTDPERCAPITA for every city on every unique date, and creates a new column, df['RANKING'] with the same date and city.

Solution

You can use:

# MinMax scaler: (rank - min) / (max - min)
ranking = lambda x: (x.rank() - 1) / (len(x) - 1)

# Rank between [0, 1] -> 0 the lowest, 1 the highest
df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].apply(ranking)

# Rank between [1, 4149] -> 1 the lowest, 4149 the highest
# df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].rank('dense')

Output:

Date	Code	City	State	Population	Quantity	QTDPERCAPITA	RANKING
2020-01	11001	Los Angeles	CA	5000000	100000	0.02	0
2020-02	11001	Los Angeles	CA	5000000	125000	0.025	0
2020-03	11001	Los Angeles	CA	5000000	135000	0.027	0
2020-01	12002	Houston	TX	3000000	150000	0.05	1
2020-02	12002	Houston	TX	3000000	100000	0.033	1
2020-03	12002	Houston	TX	3000000	200000	0.066	1
2021-07	11001	Los Angeles	CA	5500499	340000	0.618	1
2021-07	12002	Houston	TX	3250012	211000	0.065	0

Answered By - Corralien

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, May 27, 2022

[FIXED] How to create a ranking variable/function for different periods in a panel data?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels