Issue
I'm trying to create a new df from race_dbs
that's grouped by 'horse_id'
showing the number of times 'place' = 1
as well as the total number of times that 'horse_id'
occurs.
Some background on the dataset if it's helpful;
race_dbs
contains horse race data. There are 12 horses in a race, for each is shown their odds, fire, place, time, and gate number.
What I'm trying to achieve from this code is the calculation of win rates for each horse.
- A win is denoted by
'place' = 1
- Total race count will be calculated by how many times a particular 'horse_id' occurs in the db.
race_dbs
race_id | horse_id | odds | fire | place | horse_time | gate |
---|---|---|---|---|---|---|
V14qANzi | 398807 | NaN | 0 | 1 | 72.0191 | 7 |
xeieZak | 191424 | NaN | 0 | 8 | 131.3010 | 10 |
xeieZak | 139335 | NaN | 0 | 1 | 131.3713 | 9 |
xeieZak | 137195 | NaN | 0 | 11 | 131.6310 | 11 |
xeieZak | 398807 | NaN | 0 | 12 | 131.7886 | 2 |
... | ... | .. | .. | ... | ... | .. |
From this simple table the output would look like, but please bear in mind my dataset is very large, containing 12882353 rows in total.
desired output
horse_id | wins | races | win rate |
---|---|---|---|
398807 | 1 | 2 | 50% |
191424 | 0 | 1 | 0% |
139335 | 1 | 1 | 100% |
137195 | 0 | 1 | 0% |
... | .. | .. | ... |
It should be noted that I'm a complete coding beginner so forgive me if this is an easy solve.
I have tried to use the groupby and lambda pandas functions but I am struggling to combine both functions, and believe there will be a much simpler way.
import pandas as pd
race_db = pd.read_csv('horse_race_data_db.csv')
race_db_2 = pd.read_csv('2_horse_race_data.csv')
frames = [race_db, race_db_2]
race_dbs = pd.concat(frames, ignore_index=True, sort=False)
race_dbs_horse_wins = race_dbs.groupby('horse_id')['place'].apply(lambda x: x[x == 1].count())
race_dbs_horse_sums = race_dbs.groupby('horse_id').aggregate({"horse_id":['sum']})
Thanks for the help!
Solution
For count True
s values create helper boolean column and aggregate sum
, for win rate
aggregate mean
and for count use GroupBy.size
in named aggregations by GroupBy.agg
:
out = (race_dbs.assign(no1 = race_dbs['place'].eq(1))
.groupby('horse_id', sort=False, as_index=False)
.agg(**{'wins':('no1','sum'),
'races':('horse_id','size'),
'win rate':('no1','mean')}))
print (out)
horse_id wins races win rate
0 398807 1 2 0.5
1 191424 0 1 0.0
2 139335 1 1 1.0
3 137195 0 1 0.0
Answered By - jezrael
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.