Issue
I have a DataFrame whose first 3 rows look like this:
white_rating opening_short_name
0 1500 Slav Defense
1 1322 Nimzowitsch Defense
2 1496 King's Pawn Game
3 1439 Queen's Pawn Game
4 1523 Philidor Defense
...
and so on. Full DataFrame can be found here (3*20,000). Now, I want to know how opening values change in frequency as rating increases from around 700 to around 2700.
For that, I had created another DataFrame of the top 10 openings, since there are a lot of openings, and I was trying to get one DataFrame for each of these openings where I can get the frequency of the times it was played at each rating interval, such as 700-800, 1500-1600, and so on.
opening=data["opening_short_name"].value_counts().nlargest(10) #Top 10 openings
SG = data[data['opening_short_name']=='Scotch Game'][['white_rating']] #Scotch Game data
A=pd.DataFrame(SG[SG['white_rating']>900])
Above is an attempt at creating a range of data, but its obviously not able to get the frequency. Say, for 10 openings, and 20 intervals from 700 to 2700, it amounts to 10*20 DFs before I get proper data!
This is where I am stuck. I do not understand how to extract data for the number of times an opening appears, with the value of the cells being the frequency at which it appears at each interval of 100 ratings. I will then divide these vales with the number of all games played by players at that rating interval, to get a proper value of use factor.
The DataFrame I want is to help me create a histogram or an area chart showing the distribution.
This is how it should ideally look like, but with 10 openings' columns and the rows extending to all intervals to 2700-2800. Of course, the rows below are named to help make the idea clear, in practice, it needs to be named with the lower limit of rating:
Queen's Gambit Italian Game Sicilian Defense
700-800 0.005 0.05 0.01
800-900 0.02 0.2 0.03
900-1000 0.09 0.05 0.04
...
2700-2800 0.02 0.05 0.09
I think there can be a configuration of groupby(), sum(), iloc(). to achieve this, but I can't figure it out. I am new to pandas, so any help in figuring out the exact syntax will help. I have tried reading the pandas documentation, but to no avail.
Solution
You can use groupby
to apply
value_counts
after binning the data with pandas.cut
:
df = pd.read_csv('RO - RD.csv', index_col=0)
(df['opening_short_name']
.groupby(pd.cut(df['white_rating'],
range(700, 3000, 100)))
.apply(lambda s: s.value_counts(normalize=True))
.unstack(1, fill_value=0)
)
(partial) output:
Alekhine Defense Alekhine Defense #2 Alekhine Defense #3 ...
white_rating
(700, 800] 0.000000 0.000000 0.000000
(800, 900] 0.018868 0.000000 0.000000
(900, 1000] 0.010582 0.000000 0.000000
(1000, 1100] 0.027273 0.000000 0.000000
(1100, 1200] 0.009825 0.000000 0.000000
(1200, 1300] 0.014055 0.000000 0.000000
(1300, 1400] 0.010160 0.000484 0.000000
(1400, 1500] 0.010386 0.000325 0.000325
(1500, 1600] 0.011136 0.000000 0.000000
...
bonus: data visualization as clustermap
import seaborn as sns
df2 = (
df['opening_short_name']
.groupby(pd.cut(df['white_rating'],
range(700, 3000, 100)))
.apply(lambda s: s.value_counts(normalize=True))
.unstack(1, fill_value=0)
)
sns.clustermap(df2, cmap='Greys')
on can clearly see that beginners and top players play very differently
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.