Issue
I'm loading a dataset from a .csv file that includes special characters such as €, ă or ș.
Normally they should load ok with UTF-8 encoding, but when displaying them in a jupyter notebook all these characters are not rendered properly.
code I use to load the .csv file:
inter_df = pd.read_csv(
f,
header=0,
sep='|',
encoding='utf-8',
engine='python',
error_bad_lines=False
)
Can anybody suggest a solution on how to handle these special characters?
Solution
What you see is Windows-1252 encoding interpreting a UTF-8 character.€ in UTF-8 is E282AC, which corresponding to Windows-1252 á ‚ ¬. I don't think it's pandas problem since your file is correctly decoded using utf-8 decoder or there should be an error. Since you mentioned you load it in jupyter notebook, the displayed encoding is decided by your browser. Usually, jupyter will send a Content-Type
header and specify the charset is UTF-8. However if jupyter is too old or browser is too old they may not use this attribute (As far as I know IE 11 will ignore this if you don't set IE encoding settings to auto). Thus the browser will try to interpret these characters in Windows-1252 encoding.
Answered By - whilrun
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.