Thursday, January 20, 2022

[FIXED] Python Request not getting all data

January 20, 2022 beautifulsoup, python, python-requests No comments

Issue

I'm trying to scrape data from Google translate for educational purpose.

Here is the code

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

#https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello
#tlid-transliteration-content transliteration-content full

class Phonetizer:
    def __init__(self,sentence : str,language_ : str = 'en'):
        self.words=sentence.split()
        self.language=language_
    def get_phoname(self):
        for word in self.words:
            print(word)
            url="https://translate.google.com/#view=home&op=translate&sl="+self.language+"&tl="+self.language+"&text="+word
            print(url)
            req = Request(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0'})
            webpage = urlopen(req).read()
            f= open("debug.html","w+")
            f.write(webpage.decode("utf-8"))
            f.close()
            #print(webpage)
            bsoup = BeautifulSoup(webpage,'html.parser')
            phonems = bsoup.findAll("div", {"class": "tlid-transliteration-content transliteration-content full"})
            print(phonems)
            #break

The problem is when gives me the html, there is no tlid-transliteration-content transliteration-content full class, of css.

But using inspect, I have found that, phoneme are inside this css class, here take a snap :

I have saved the html, and here it is, take a look, no tlid-transliteration-content transliteration-content full is present and it not like other google translate page, it is not complete. I have heard google blocks crawler, bot, spyder. And it can be easily detected by their system, so I added the additional header, but still I can't access the whole page.

How can I do so ? Access the whole page and read all data from google translate page?

Want to contribute on this project?

I have tried this code below :

from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
lang = "en"
word = "hello"
url="https://translate.google.com/#view=home&op=translate&sl="+lang+"&tl="+lang+"&text="+word
async def get_url():
    r = await asession.get(url)
    print(r)
    return r
results = asession.run(get_url)
for result in results:
    print(result.html.url)
    print(result.html.find('#tlid-transliteration-content'))
    print(result.html.find('#tlid-transliteration-content transliteration-content full'))

It gives me nothing, till now.

Solution

Yes, this happens because some javascript generated content are rendered by the browser on page load, but what you see is the final DOM, after all kinds of manipulation happened by javascript (adding content). To solve this you would need to use selenium but it has multiple downsides like speed and memory issues. A more modern and better way, in my opinion, is to use requests-html where it will replace both bs4 and urllib and it has a render method as mentioned in the documentation.

Here is a sample code using requests_html, just keep in mind what you trying to print is not utf8 so you might run into some issues printing it on some editors like sublime, it ran fine using cmd.

from requests_html import HTMLSession
session = HTMLSession()
r = session.get("https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello")
r.html.render()
css = ".source-input .tlid-transliteration-content"
print(r.html.find(css, first=True).text)
# output: heˈlō,həˈlō

Answered By - Marsilinou Zaky

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 20, 2022

[FIXED] Python Request not getting all data

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels