Issue
I use the same piece of code which I use to import multiple dataframes. Usually the have the same column names with different data. However sometimes they have different spaces before or after the names of the columns.
df = pd.read_csv(
file_path,
delimiter="|",
low_memory=True,
dtype=schema,
usecols=schema.keys(),
)
The schema of the file is in a different file:
file_schema = {
" Age ": str,
" Name ": str,
" Country ": str,}
for some other cases, there are no spaces before and after the names:
file_schema = {
"Age": str,
"Name": str,
"Country": str,}
Currently with having one schema, if there is no match in the spaces before the name of the columns, I'm having errors related to usecols
.
I'm wondering if there's a way in one schema file to write the names of the columns and for it to work no matter how many spaces we have before or after the names?
Solution
I think it should be possible to match the column names with
pd.read_csv(..., usecols=lambda x: x.strip() in schema.keys())
and then either strip them afterwards with
df.columns = df.columns.str.strip()
or even better try to pass them explicitly with
pd.read_csv(..., header=0, names=schema.keys())
if you know that all columns declared in schema
will be in the file and in order.
Not sure, whether dtype=schema
will cause the next problems immediatlely, though
Answered By - maow
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.