Thursday, October 20, 2022

[FIXED] How to decode the string encoded by document.write in BeautifulSoup, Python?

October 20, 2022 beautifulsoup, python, python-requests, web-scraping No comments

Issue

As title says, I'm stuck here for hours with no documentation or any solution.
This is the website where I started: https://idhsaa.org/directory. I cannot access the Email IDs not only over here, but also inside the individual websites that opens up upon clicking on the school names.

The format that I found is something like this:

<p>
    <script>
        document.write(window.atob('PGEgaHJlZj0nbWFpbHRvOnNwZWNrZXI3M0B5YWhvby5jb20nPkVtYWlsPC9hPg=='));
    </script>
    <a href="mailto:[email protected]">Email</a>
    </br>
</p>

I managed to get the encoded code that looks something like this:

mailto:<script>document.write(window.atob('PGEgaHJlZj0nbWFpbHRvOmFkbWluQGlkaHNhYS5vcmcnPmFkbWluQGlkaHNhYS5vcmc8L2E+'));</script>

The question is, how do I decode this to get the Email IDs?
Depending on what I saw in the above output that I got, I assume, I need to decode that to get the actual email.

Here's the code that I'd been working on:

def url_parser(url):
    headers = {
        "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246',
    }
    html_doc = requests.get(url, headers=headers).text
    soup = BeautifulSoup(html_doc, 'html.parser')
    return soup


def data_fetch(url):
    soup = url_parser(url)
    table = soup.find('table').find('tbody')
    rows = table.find_all('tr')

    data = []
    for row in rows:
        school_name = row.find_all('a')

        for school in school_name:
            if 'school?' in school.get('href'):
                school_website = url.replace('/directory', f'/{school_web_id}')

                school_site = url_parser(school_website)
                principal_email_encoded = school_site.find_all('a')
                for principal_email in principal_email_encoded:
                    email = principal_email.get('href')
                    if 'maito:<script>' in email:
                        print(email.replace('maito:<script>', '').replace(';</script>', ''))



def main():
    url = "https://idhsaa.org/directory"
    data_fetch(url)


if __name__ == "__main__":
    main()

Solution

These are base64 encoded strings, you can decode the value by using the base64 module included in the Python Standard library.

For example, after extracting the encoded string you can do the following:

import base64
encoded_str = "PGEgaHJlZj0nbWFpbHRvOmFkbWluQGlkaHNhYS5vcmcnPmFkbWluQGlkaHNhYS5vcmc8L2E+"
decoded_html = base64.b64decode(encoded_str).decode("utf-8")
print(decoded_html)

Output:

"<a href='mailto:[email protected]'>[email protected]</a>"

Answered By - Wasi

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, October 20, 2022

[FIXED] How to decode the string encoded by document.write in BeautifulSoup, Python?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels