Issue
I have fetched data from a website using BeautifulSoup
module. I know from meta header that the source encoding for this document is 'iso-8859-1'. I also know that BeutifulSoup
automatically transcode to 'UTF-8' upon creation of BeautifulSoup
object.
import requests
from bs4 import BeautifulSoup
url = "https://www.assemblee-nationale.fr/12/cri/2003-2004/20040001.asp"
r=requests.get(url)
soup_data=BeautifulSoup(r.content, 'lxml')
print(soup_data.prettify())
Unfortunately, the website has a duplicate element.
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Upon inspection of the BeautifulSoup
object using prettify, I realized that BeautifulSoup
converted only one of these meta tags.
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="text/html; charset=iso-8859-1" http-eqiv="Content-Type"/>
I'm therefore confused what is the actual encoding of my BeautifulSoup
object.
Also, during data processing I realized that some of text elements of this object are not properly parsed by my PyCharm console. These strings are 'iso-8859-1' code characters. Therefore, I suspect that the object is either still in ISO encoding or even worse, somehow mixed up.
['\xa0\xa0\xa0\xa0M. le président.' '\xa0\xa0\xa0\xa0M. le président.'
I have seen these ISO characters for the first time after I run a numpy function.
series = np.apply_along_axis(lambda x: x[0].get_text(), 0, [df])
Any suggestions on how to proceed from this situation? I would like to convert the object to UTF-8 (and be 100% sure it's fully in UTF-8).
Solution
BeautifulSoup
used the ISO-8859-1
encoding to decode the r.content
(a bytes
object) into Unicode (a str
object). A str
is not encoded at all. It is made of of Unicode code points.
It turns out the data wasn't encoded in ISO-8859-1. It was encoded in Windows-1252, a similar encoding with a few extra translations (see the hyperlinks for each).
The requests
response indicates the encoding the website used (r.encoding
) and the apparent encoding using its detection code (r.apparent_encoding
). Here are some differences in the actual text I found:
import requests
from bs4 import BeautifulSoup
url = "https://www.assemblee-nationale.fr/12/cri/2003-2004/20040001.asp"
r=requests.get(url)
print(f'{r.encoding=}')
print(f'{r.apparent_encoding=}')
print()
soup_data=BeautifulSoup(r.content, 'lxml')
print(repr(soup_data.find('a',href="http://www2.assemblee-nationale.fr/scrutins/liste/(legislature)/15/(type)/AUT").text))
print(repr(soup_data.find('a',href="#",accesskey="0").text))
print()
#Using the correct encoding
soup_data=BeautifulSoup(r.content, 'lxml', from_encoding='Windows-1252')
print(repr(soup_data.find('a',href="http://www2.assemblee-nationale.fr/scrutins/liste/(legislature)/15/(type)/AUT").text))
print(repr(soup_data.find('a',href="#",accesskey="0").text))
Output. Note the \x85
and \x92
code points in "censure…" and "d’accessibilité" in the first instance. The …
(U+2026) and ’
(U+2019) code points don't exist in ISO-8859-1 and the bytes 0x85 and 0x92 translate to U+0085 and U+0092 respectively which are unprintable control codes. I've used repr()
to show them as escape codes.
r.encoding='ISO-8859-1'
r.apparent_encoding='Windows-1252'
'Autres scrutins solennels (déclarations, motions de censure\x85)'
'Politique d\x92accessibilité'
'Autres scrutins solennels (déclarations, motions de censure…)'
'Politique d’accessibilité'
Answered By - Mark Tolonen
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.