Issue
I am trying to scrape the Email Address from the following webpage using Python-BS4-requests, but the email address is not accessible in the source code.
https://www.tripadvisor.in/Attraction_Review-g189400-d2020955-Reviews-Chat_Tours-Athens_Attica.html
The email address opens up in my Mail App, but I could not find the link to it on the page source. I understand this could be done by observing the network tab and making the same post request that websites makes, but could not make it work.
Thanks in advance!!
Solution
The email is Base64 encoded inside the Json variable found on the page.
You can use this example to get all emails found on page:
import re
import json
import base64
import requests
from bs4 import BeautifulSoup
url = 'https://www.tripadvisor.in/Attraction_Review-g189400-d2020955-Reviews-Chat_Tours-Athens_Attica.html'
html_data = requests.get(url).text
data = re.search(r'window\.__WEB_CONTEXT__=(\{.*?\});', html_data).group(1)
data = json.loads(data.replace('pageManifest', '"pageManifest"'))
def get_emails(val):
if isinstance(val, dict):
for k, v in val.items():
if k == 'email':
if v:
yield v
else:
yield from get_emails(v)
elif isinstance(val, list):
for v in val:
yield from get_emails(v)
for email in get_emails(data):
email = base64.b64decode(email).decode('utf-8')
email = re.search(r'mailto:(.*)_', email).group(1)
print(email)
Prints:
[email protected]
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.