Issue
I used Scrapy on a Linux machine to crawl some websites and saved in a CSV. When I retrieve the dataset and view on a Windows machine, I saw these characters 
. Here is what I do to re-encode them to UTF-8-SIG
:
import pandas as pd
my_data = pd.read_csv("./dataset/my_data.csv")
output = "./dataset/my_data_converted.csv"
my_data.to_csv(output, encoding='utf-8-sig', index=False)
So now they become ?
if viewed on VSCode. But if I view on Notepad++, I don't see these. How do I actually remove them all?
Solution
Given your comment, I suppose that you ended up having two BOMs.
Let's look at a small example.
I'm using built-in open
instead of pd.read_csv
/pd.to_csv
, but the meaning of the encoding
parameter is the same.
Let's create a file saved as UTF-8 with a BOM:
>>> text = 'foo'
>>> with open('/tmp/foo', 'w', encoding='utf-8-sig') as f:
... f.write(text)
Now let's read it back in. But we use a different encoding: "utf-8" instead of "utf-8-sig". In your case, you didn't specify the encoding parameter at all, but the default value is most probably "utf-8" or "cp-1252", which both keep the BOM. So the following is more or less equivalent to your code snippet:
>>> with open('/tmp/foo', 'r', encoding='utf8') as f:
... text = f.read()
...
>>> text
'\ufefffoo'
>>> with open('/tmp/foo_converted', 'w', encoding='utf-8-sig') as f:
... f.write(text)
The BOM is read as part of the the text; it's the first character (here represented as "\ufeff"
).
Let's see what's actually in the files, using a suitable command-line tool:
$ hexdump -C /tmp/foo
00000000 ef bb bf 66 6f 6f |...foo|
00000006
$ hexdump -C /tmp/foo_converted
00000000 ef bb bf ef bb bf 66 6f 6f |......foo|
00000009
In UTF-8, the BOM is encoded as the three bytes EF BB BF
.
Clearly, the second file has two of them.
So even a BOM-aware program will find some non-sense character in the beginning of foo_converted, as the BOM is only stripped once.
Answered By - lenz
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.