Saturday, November 19, 2022

[FIXED] Python: pandas dataframe: Remove " ï»¿ " BOM character

November 19, 2022 pandas, python, python-3.x, utf-8 No comments

Issue

I used Scrapy on a Linux machine to crawl some websites and saved in a CSV. When I retrieve the dataset and view on a Windows machine, I saw these characters ï»¿. Here is what I do to re-encode them to UTF-8-SIG:

import pandas as pd

my_data = pd.read_csv("./dataset/my_data.csv")
output = "./dataset/my_data_converted.csv"
my_data.to_csv(output, encoding='utf-8-sig', index=False)

So now they become ? if viewed on VSCode. But if I view on Notepad++, I don't see these. How do I actually remove them all?

Solution

Given your comment, I suppose that you ended up having two BOMs.

Let's look at a small example. I'm using built-in open instead of pd.read_csv/pd.to_csv, but the meaning of the encoding parameter is the same.

Let's create a file saved as UTF-8 with a BOM:

>>> text = 'foo'
>>> with open('/tmp/foo', 'w', encoding='utf-8-sig') as f:
...     f.write(text)

Now let's read it back in. But we use a different encoding: "utf-8" instead of "utf-8-sig". In your case, you didn't specify the encoding parameter at all, but the default value is most probably "utf-8" or "cp-1252", which both keep the BOM. So the following is more or less equivalent to your code snippet:

>>> with open('/tmp/foo', 'r', encoding='utf8') as f:
...     text = f.read()
... 
>>> text
'\ufefffoo'
>>> with open('/tmp/foo_converted', 'w', encoding='utf-8-sig') as f:
...     f.write(text)

The BOM is read as part of the the text; it's the first character (here represented as "\ufeff").

Let's see what's actually in the files, using a suitable command-line tool:

$ hexdump -C /tmp/foo
00000000  ef bb bf 66 6f 6f                                 |...foo|
00000006
$ hexdump -C /tmp/foo_converted 
00000000  ef bb bf ef bb bf 66 6f  6f                       |......foo|
00000009

In UTF-8, the BOM is encoded as the three bytes EF BB BF. Clearly, the second file has two of them. So even a BOM-aware program will find some non-sense character in the beginning of foo_converted, as the BOM is only stripped once.

Answered By - lenz

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, November 19, 2022

[FIXED] Python: pandas dataframe: Remove " ï»¿ " BOM character

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels