Monday, June 27, 2022

[FIXED] Pandas read csv not reading a file properly. Not splitting into proper columns

June 27, 2022 csv, pandas, python No comments

Issue

So I'm trying to read in this dataset from Kaggle.

https://www.kaggle.com/gmadevs/atp-matches-dataset#atp_matches_2016.csv

I'm using pandas' read_csv function to do so, but it isn't splitting the columns properly. I've tried this code

df_2016 = pd.read_csv("Path/to/file/atp_matches_2016.csv")

The printed out data frame gives me this though

                                                                                                                                         tourney_id  ... l_bpFaced
2016-M020 Brisbane Hard 32.0 A 20160104.0 300.0 105683.0 4.0 NaN Milos Raonic  R 196.0 CAN 25.021218 14.0 2170.0 103819.0 1.0  NaN    Roger Federer  ...       NaN
                                          299.0 103819.0 1.0 NaN Roger Federer R 185.0 SUI 34.406571 3.0  8265.0 106233.0 8.0  NaN    Dominic Thiem  ...       NaN
                                          298.0 105683.0 4.0 NaN Milos Raonic  R 196.0 CAN 25.021218 14.0 2170.0 106071.0 7.0  NaN    Bernard Tomic  ...       NaN
                                          297.0 103819.0 1.0 NaN Roger Federer R 185.0 SUI 34.406571 3.0  8265.0 105777.0 NaN  NaN  Grigor Dimitrov  ...       NaN
                                          296.0 106233.0 8.0 NaN Dominic Thiem R NaN   AUT 22.335387 20.0 1600.0 105227.0 3.0  NaN      Marin Cilic  ...       NaN

Why is it having a problem splitting the columns?

I'm expecting an output of this, which is what I got for every year except 2016 and 2017 for some reason.

  tourney_id tourney_name surface  ...  l_SvGms l_bpSaved  l_bpFaced
0   2015-329        Tokyo    Hard  ...     10.0       2.0        5.0
1   2015-329        Tokyo    Hard  ...     13.0      12.0       19.0
2   2015-329        Tokyo    Hard  ...     18.0       9.0       11.0
3   2015-329        Tokyo    Hard  ...     13.0       4.0        8.0
4   2015-329        Tokyo    Hard  ...     10.0       1.0        5.0

The actual csv file looks to be in good shape and in a format identical to the other years. I also tried to specify the columns with the columns parameter in the read_csv function, but that gives me the same output.

Solution

The safest way I can think for is to read the csv twice:

rows = pd.read_csv('path/to/atp_matches_2016.csv', skiprows=[0], header = None)
# skip header line
rows = rows.dropna(axis=1, how='all')
# drop columns that only have NaNs

rows.columns = pd.read_csv('path/to/atp_matches_2016.csv', nrows=0).columns
print(rows.head(5))

Output:

  tourney_id tourney_name surface  draw_size tourney_level  tourney_date  \
0  2016-M020     Brisbane    Hard       32.0             A    20160104.0   
1  2016-M020     Brisbane    Hard       32.0             A    20160104.0   
2  2016-M020     Brisbane    Hard       32.0             A    20160104.0   
3  2016-M020     Brisbane    Hard       32.0             A    20160104.0   
4  2016-M020     Brisbane    Hard       32.0             A    20160104.0 



   match_num  winner_id  winner_seed winner_entry  ... w_bpFaced l_ace  l_df  \
0      300.0   105683.0          4.0          NaN  ...       1.0   7.0   3.0   
1      299.0   103819.0          1.0          NaN  ...       1.0   2.0   4.0   
2      298.0   105683.0          4.0          NaN  ...       4.0  10.0   3.0   
3      297.0   103819.0          1.0          NaN  ...       1.0   8.0   2.0   
4      296.0   106233.0          8.0          NaN  ...       2.0  11.0   2.0   

  l_svpt  l_1stIn  l_1stWon  l_2ndWon  l_SvGms  l_bpSaved l_bpFaced  
0   61.0     34.0      25.0      14.0     10.0        3.0       5.0  
1   55.0     31.0      18.0       9.0      8.0        2.0       6.0  
2   84.0     54.0      41.0      16.0     12.0        2.0       2.0  
3  104.0     62.0      46.0      21.0     16.0        8.0      11.0  
4   98.0     52.0      41.0      27.0     15.0        7.0       8.0

Answered By - Chris

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, June 27, 2022

[FIXED] Pandas read csv not reading a file properly. Not splitting into proper columns

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels