Sunday, November 13, 2022

[FIXED] Beautiful Soup HTML parsing anamoly

November 13, 2022 html, python-3.x, web-scraping No comments

Issue

I am trying to get the text out of certain class in a HTML using beautiful soup. I have successfully got the texts but, there are some anomalies(unrecognisable characters) in it like shown in the image below. How can I solve it with a python code instead of manually deleting these anomalies.

Code:

    try:
        html =requests.get(url)
    except:
        print("no conection")
    try:
        soup = BS(html.text,'html.parser')
    except:
        print("pasre error")
    print(soup.find('div',{'class':'_3WlLe clearfix'}).get_text())

Solution

When you access html.text, Requests tries to determine the character encoding so it can properly decode the raw bytes it received from the server. The content-type header that timesofindia sent is text/html; charset=iso-8859-1, which is what Requests went with. The character encoding is almost certainly utf-8.

You can fix this by either setting the encoding of html to utf-8 before accessing html.text:

    try:
        html =requests.get(url)
        html.encoding = 'utf-8'
    except:
        print("no conection")
    try:
        soup = BS(html.text,'html.parser')
    except:
        print("pasre error")
    print(soup.find('div',{'class':'_3WlLe clearfix'}).get_text())

or decode html.content as utf-8, and pass that into BS instead of html.text:

    try:
        html =requests.get(url)
    except:
        print("no conection")
    try:
        soup = BS(html.content.decode('utf-8'),'html.parser')
    except:
        print("pasre error")
    print(soup.find('div',{'class':'_3WlLe clearfix'}).get_text())

I would highly recommend you learn about character encoding and Unicode. It is very easy to get tripped up by it. We've all been there.

Characters, Symbols and the Unicode Miracle - Computerphile by Tom Scott and Sean Riley

What every programmer absolutely, positively needs to know about encodings and character sets to work with text by David C. Zentgraf

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Answered By - GordonAitchJay

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, November 13, 2022

[FIXED] Beautiful Soup HTML parsing anamoly

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels