Issue
I have an exported Excel CSV file with str(date), str, float, float, float, float, int
as column values. Some of the Excel cells are empty, thus using keep_default_na
is needed.
Some are in double quotes, thousand separators present.
The number of parameters seems to confuse the pandas parser because running this outputs:
ValueError: could not convert string to float: '1,917.6'
Seems that when keep_default_na
is present, thousands
gets ignored. When I run this without any ,,,,,,
lines in the csv, it works perfectly.
CSV FILE:
TS
Date,Symbol,Open,High,Low,Close,Volume
6/14/2022 23:59,A,918.1,918.1,918.1,918.1,1
,,,,,,
6/14/2022 23:57,A,"1,917.6",917.6,917.6,917.6,1
,,,,,,
,,,,,,
CODE
df = pd.read_csv('test.csv',
skiprows=1,
quotechar='"',
thousands=',',
keep_default_na=False,
dtype = {'Open': np.float64, 'High': np.float64,
'Low': np.float64, 'Close': np.float64,
'Volume': np.uint32, 'Symbol': 'string'})
Solution
I think the problem may be that the routine which reports which values are the problem isn't as sophisticated as the full parsing engine - I think you have multiple dtype issues in each column and pandas is telling you the wrong value is the actual issue. float columns can't have the value "" (so it needs to be included in a list of na_values), and int columns can't have NA values, so you either need to skip the footer, or provide na_values
and use float dtypes. Both of the following work for me:
skipfooter works with your arguments, but falls back to the python engine:
In [26]: df = pd.read_csv('test.csv',
...: skiprows=1,
...: quotechar='"',
...: thousands=",",
...: keep_default_na=False,
...: dtype = {'Open': np.float64, 'High': np.float64,
...: 'Low': np.float64, 'Close': np.float64,
...: 'Volume': np.uint32, 'Symbol': 'string'},
...: skipfooter=2,
...: )
Alternatively, you could read with float columns and specify [""] as a na_value:
In [29]: df = pd.read_csv('test.csv',
...: skiprows=1,
...: quotechar='"',
...: thousands=",",
...: keep_default_na=False,
...: dtype = {'Open': np.float64, 'High': np.float64,
...: 'Low': np.float64, 'Close': np.float64,
...: 'Volume': np.float32, 'Symbol': 'string'},
...: na_values=[""],
...: )
at this point, you could drop rows based on any condition you like, for example, drop rows with NaNs in 'Volume'
, then convert to int:
In [30]: df = df[df.Volume.notnull()]
In [31]: df["Volume"] = df["Volume"].astype(int)
Alternatively, you could set an invalid value marker, e.g.:
In [32]: df["Volume"] = df["Volume"].fillna(-9999).astype(int)
Answered By - Michael Delgado
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.