Issue
I'm trying to scrape apps names (which exist at the bottom of the website) from [This Website] 1 using requests_html and CSS selectors, but it returns an empty list. Can you please provide an explanation? The code:
import requests_html
from requests_html import HTMLSession
s = HTMLSession()
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
}
url = 'https://www.workato.com/integrations/salesforce'
r = s.get(url, headers=headers)
r.html.render(sleep=4)
apps = r.html.find('#__layout > div > div > div > div > div > main > article.apps-page__section.apps-page__section_search > div > div > div.apps-page__integrations > div > ul')
print(apps)
I tried the following:
for app in apps:
print(app)
and I also used .text
but the output always says:
[]
Solution
The data you're looking for is embedded in one external JavaScript file (so standard beautifulsoup
doesn't help here).
To load all applications at once into a pandas DataFrame you can use next example:
import re
import requests
import pandas as pd
from ast import literal_eval
url = 'https://cdn.marie.awsprod.workato.com/mktg-assets/c8ce8de9.js'
html_doc = requests.get(url).text
data = re.search(r'JSON\.parse\(\'(.*?)\'\)', html_doc).group(1)
data = literal_eval(data)
df = pd.DataFrame.from_dict(data, orient='index')
print(df.head())
Prints:
name | title | build_type | categories | aliases | url_name | |
---|---|---|---|---|---|---|
kissmetrics | kissmetrics | Kissmetrics | unsupported | ['Upcoming'] | nan | nan |
gusto | gusto | Gusto | custom | ['HR management', 'Staff Management', 'Time and Expense'] | nan | nan |
adobeexpmgr | adobeexpmgr | Adobe Experience Manager | unsupported | ['Sales'] | nan | nan |
synthesio | synthesio | Synthesio | unsupported | ['Sales'] | nan | nan |
teamwork | teamwork | Teamwork | unsupported | ['Sales'] | nan | nan |
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.