Issue
So I'm trying to read in this dataset from Kaggle.
https://www.kaggle.com/gmadevs/atp-matches-dataset#atp_matches_2016.csv
I'm using pandas' read_csv function to do so, but it isn't splitting the columns properly. I've tried this code
df_2016 = pd.read_csv("Path/to/file/atp_matches_2016.csv")
The printed out data frame gives me this though
tourney_id ... l_bpFaced
2016-M020 Brisbane Hard 32.0 A 20160104.0 300.0 105683.0 4.0 NaN Milos Raonic R 196.0 CAN 25.021218 14.0 2170.0 103819.0 1.0 NaN Roger Federer ... NaN
299.0 103819.0 1.0 NaN Roger Federer R 185.0 SUI 34.406571 3.0 8265.0 106233.0 8.0 NaN Dominic Thiem ... NaN
298.0 105683.0 4.0 NaN Milos Raonic R 196.0 CAN 25.021218 14.0 2170.0 106071.0 7.0 NaN Bernard Tomic ... NaN
297.0 103819.0 1.0 NaN Roger Federer R 185.0 SUI 34.406571 3.0 8265.0 105777.0 NaN NaN Grigor Dimitrov ... NaN
296.0 106233.0 8.0 NaN Dominic Thiem R NaN AUT 22.335387 20.0 1600.0 105227.0 3.0 NaN Marin Cilic ... NaN
Why is it having a problem splitting the columns?
I'm expecting an output of this, which is what I got for every year except 2016 and 2017 for some reason.
tourney_id tourney_name surface ... l_SvGms l_bpSaved l_bpFaced
0 2015-329 Tokyo Hard ... 10.0 2.0 5.0
1 2015-329 Tokyo Hard ... 13.0 12.0 19.0
2 2015-329 Tokyo Hard ... 18.0 9.0 11.0
3 2015-329 Tokyo Hard ... 13.0 4.0 8.0
4 2015-329 Tokyo Hard ... 10.0 1.0 5.0
The actual csv file looks to be in good shape and in a format identical to the other years. I also tried to specify the columns with the columns parameter in the read_csv function, but that gives me the same output.
Solution
The safest way I can think for is to read the csv twice:
rows = pd.read_csv('path/to/atp_matches_2016.csv', skiprows=[0], header = None)
# skip header line
rows = rows.dropna(axis=1, how='all')
# drop columns that only have NaNs
rows.columns = pd.read_csv('path/to/atp_matches_2016.csv', nrows=0).columns
print(rows.head(5))
Output:
tourney_id tourney_name surface draw_size tourney_level tourney_date \
0 2016-M020 Brisbane Hard 32.0 A 20160104.0
1 2016-M020 Brisbane Hard 32.0 A 20160104.0
2 2016-M020 Brisbane Hard 32.0 A 20160104.0
3 2016-M020 Brisbane Hard 32.0 A 20160104.0
4 2016-M020 Brisbane Hard 32.0 A 20160104.0
match_num winner_id winner_seed winner_entry ... w_bpFaced l_ace l_df \
0 300.0 105683.0 4.0 NaN ... 1.0 7.0 3.0
1 299.0 103819.0 1.0 NaN ... 1.0 2.0 4.0
2 298.0 105683.0 4.0 NaN ... 4.0 10.0 3.0
3 297.0 103819.0 1.0 NaN ... 1.0 8.0 2.0
4 296.0 106233.0 8.0 NaN ... 2.0 11.0 2.0
l_svpt l_1stIn l_1stWon l_2ndWon l_SvGms l_bpSaved l_bpFaced
0 61.0 34.0 25.0 14.0 10.0 3.0 5.0
1 55.0 31.0 18.0 9.0 8.0 2.0 6.0
2 84.0 54.0 41.0 16.0 12.0 2.0 2.0
3 104.0 62.0 46.0 21.0 16.0 8.0 11.0
4 98.0 52.0 41.0 27.0 15.0 7.0 8.0
Answered By - Chris
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.