Issue
As title says, I'm stuck here for hours with no documentation or any solution.
This is the website where I started: https://idhsaa.org/directory. I cannot access the Email IDs not only over here, but also inside the individual websites that opens up upon clicking on the school names.
The format that I found is something like this:
<p>
<script>
document.write(window.atob('PGEgaHJlZj0nbWFpbHRvOnNwZWNrZXI3M0B5YWhvby5jb20nPkVtYWlsPC9hPg=='));
</script>
<a href="mailto:[email protected]">Email</a>
</br>
</p>
I managed to get the encoded code that looks something like this:
mailto:<script>document.write(window.atob('PGEgaHJlZj0nbWFpbHRvOmFkbWluQGlkaHNhYS5vcmcnPmFkbWluQGlkaHNhYS5vcmc8L2E+'));</script>
The question is, how do I decode this to get the Email IDs?
Depending on what I saw in the above output that I got, I assume, I need to decode that to get the actual email.
Here's the code that I'd been working on:
def url_parser(url):
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246',
}
html_doc = requests.get(url, headers=headers).text
soup = BeautifulSoup(html_doc, 'html.parser')
return soup
def data_fetch(url):
soup = url_parser(url)
table = soup.find('table').find('tbody')
rows = table.find_all('tr')
data = []
for row in rows:
school_name = row.find_all('a')
for school in school_name:
if 'school?' in school.get('href'):
school_website = url.replace('/directory', f'/{school_web_id}')
school_site = url_parser(school_website)
principal_email_encoded = school_site.find_all('a')
for principal_email in principal_email_encoded:
email = principal_email.get('href')
if 'maito:<script>' in email:
print(email.replace('maito:<script>', '').replace(';</script>', ''))
def main():
url = "https://idhsaa.org/directory"
data_fetch(url)
if __name__ == "__main__":
main()
Solution
These are base64 encoded strings, you can decode the value by using the base64 module included in the Python Standard library.
For example, after extracting the encoded string you can do the following:
import base64
encoded_str = "PGEgaHJlZj0nbWFpbHRvOmFkbWluQGlkaHNhYS5vcmcnPmFkbWluQGlkaHNhYS5vcmc8L2E+"
decoded_html = base64.b64decode(encoded_str).decode("utf-8")
print(decoded_html)
Output:
"<a href='mailto:[email protected]'>[email protected]</a>"
Answered By - Wasi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.