Issue
I am trying to get the text out of certain class in a HTML using beautiful soup. I have successfully got the texts but, there are some anomalies(unrecognisable characters) in it like shown in the image below. How can I solve it with a python code instead of manually deleting these anomalies.
Code:
try:
html =requests.get(url)
except:
print("no conection")
try:
soup = BS(html.text,'html.parser')
except:
print("pasre error")
print(soup.find('div',{'class':'_3WlLe clearfix'}).get_text())
Solution
When you access html.text
, Requests tries to determine the character encoding so it can properly decode the raw bytes it received from the server. The content-type
header that timesofindia sent is text/html; charset=iso-8859-1
, which is what Requests went with. The character encoding is almost certainly utf-8
.
You can fix this by either setting the encoding
of html
to utf-8
before accessing html.text
:
try:
html =requests.get(url)
html.encoding = 'utf-8'
except:
print("no conection")
try:
soup = BS(html.text,'html.parser')
except:
print("pasre error")
print(soup.find('div',{'class':'_3WlLe clearfix'}).get_text())
or decode html.content
as utf-8
, and pass that into BS
instead of html.text
:
try:
html =requests.get(url)
except:
print("no conection")
try:
soup = BS(html.content.decode('utf-8'),'html.parser')
except:
print("pasre error")
print(soup.find('div',{'class':'_3WlLe clearfix'}).get_text())
I would highly recommend you learn about character encoding and Unicode. It is very easy to get tripped up by it. We've all been there.
Characters, Symbols and the Unicode Miracle - Computerphile by Tom Scott and Sean Riley
What every programmer absolutely, positively needs to know about encodings and character sets to work with text by David C. Zentgraf
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Answered By - GordonAitchJay
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.