Friday, February 11, 2022

[FIXED] Store frequency of different strings appearing in parent data's column, while considering range of numbers in its other column

February 11, 2022 dataframe, pandas, python No comments

Issue

I have a DataFrame whose first 3 rows look like this:

   white_rating   opening_short_name
0          1500         Slav Defense
1          1322  Nimzowitsch Defense
2          1496     King's Pawn Game
3          1439    Queen's Pawn Game
4          1523     Philidor Defense
...

and so on. Full DataFrame can be found here (3*20,000). Now, I want to know how opening values change in frequency as rating increases from around 700 to around 2700.

For that, I had created another DataFrame of the top 10 openings, since there are a lot of openings, and I was trying to get one DataFrame for each of these openings where I can get the frequency of the times it was played at each rating interval, such as 700-800, 1500-1600, and so on.

opening=data["opening_short_name"].value_counts().nlargest(10) #Top 10 openings
SG = data[data['opening_short_name']=='Scotch Game'][['white_rating']] #Scotch Game data
A=pd.DataFrame(SG[SG['white_rating']>900])

Above is an attempt at creating a range of data, but its obviously not able to get the frequency. Say, for 10 openings, and 20 intervals from 700 to 2700, it amounts to 10*20 DFs before I get proper data!

This is where I am stuck. I do not understand how to extract data for the number of times an opening appears, with the value of the cells being the frequency at which it appears at each interval of 100 ratings. I will then divide these vales with the number of all games played by players at that rating interval, to get a proper value of use factor.

The DataFrame I want is to help me create a histogram or an area chart showing the distribution.

This is how it should ideally look like, but with 10 openings' columns and the rows extending to all intervals to 2700-2800. Of course, the rows below are named to help make the idea clear, in practice, it needs to be named with the lower limit of rating:

             Queen's Gambit Italian Game Sicilian Defense
700-800               0.005         0.05            0.01
800-900                0.02          0.2            0.03
900-1000               0.09         0.05            0.04 
...
2700-2800              0.02         0.05            0.09

I think there can be a configuration of groupby(), sum(), iloc(). to achieve this, but I can't figure it out. I am new to pandas, so any help in figuring out the exact syntax will help. I have tried reading the pandas documentation, but to no avail.

Solution

You can use groupby to apply value_counts after binning the data with pandas.cut:

df = pd.read_csv('RO - RD.csv', index_col=0)

(df['opening_short_name']
 .groupby(pd.cut(df['white_rating'],
                 range(700, 3000, 100)))
 .apply(lambda s: s.value_counts(normalize=True))
 .unstack(1, fill_value=0)
)

(partial) output:

             Alekhine Defense  Alekhine Defense #2  Alekhine Defense #3  ...
white_rating                                                               
(700, 800]            0.000000             0.000000             0.000000   
(800, 900]            0.018868             0.000000             0.000000   
(900, 1000]           0.010582             0.000000             0.000000   
(1000, 1100]          0.027273             0.000000             0.000000   
(1100, 1200]          0.009825             0.000000             0.000000   
(1200, 1300]          0.014055             0.000000             0.000000   
(1300, 1400]          0.010160             0.000484             0.000000   
(1400, 1500]          0.010386             0.000325             0.000325   
(1500, 1600]          0.011136             0.000000             0.000000   
...

bonus: data visualization as `clustermap`

import seaborn as sns

df2 = (
 df['opening_short_name']
 .groupby(pd.cut(df['white_rating'],
                 range(700, 3000, 100)))
 .apply(lambda s: s.value_counts(normalize=True))
 .unstack(1, fill_value=0)
)

sns.clustermap(df2, cmap='Greys')

on can clearly see that beginners and top players play very differently

output:

Answered By - mozway

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, February 11, 2022

[FIXED] Store frequency of different strings appearing in parent data's column, while considering range of numbers in its other column

Issue

Solution

bonus: data visualization as `clustermap`

0 comments:

Post a Comment

Popular Posts

Labels

Friday, February 11, 2022

Issue

Solution

bonus: data visualization as clustermap

0 comments:

Post a Comment

Popular Posts

Labels

bonus: data visualization as `clustermap`