Issue
I have a very simple problem. I'm trying to get the job description from the html of a linkedIn page, but instead of getting the html of the page I'm getting few lines that look like a javascript code instead. I'm very new to this so any help will be greatly appreciated! Thanks
Here's my code:
import requests
url = "https://www.linkedin.com/jobs/view/inside-sales-manager-at-stericycle-1089095836/"
page_html = requests.get(url).text
print(page_html)
When I run this I don't get the html that I expect containing the job description...I just get few lines of javascript code instead.
Solution
Some websites present different content based on the type of browser that is accessing the site. LinkedIn is a perfect example of such behavior. If the browser has advanced capabilities, the website may present “richer” content – something more dynamic and styled. And using the bot won't help to see these websites.
To solve this problem, you need to follow these steps:
- Download chrome-driver from here. Choose the one that matches your OS.
- Extract the driver and put it in a certain directory. For example,
\usr
- Install
Selenium
which is a python module by runningpip install selenium
. Note that, selenium depends on another package calledmsgpack
. So, you should install it first using this commandpip install msgpack
. - Now, we are ready to run the following code
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def create_browser(webdriver_path):
#create a selenium object that mimics the browser
browser_options = Options()
#headless tag created an invisible browser
browser_options.add_argument("--headless")
browser_options.add_argument('--no-sandbox')
browser = webdriver.Chrome(webdriver_path, chrome_options=browser_options)
print("Done Creating Browser")
return browser
url = "https://www.linkedin.com/jobs/view/inside-sales-manager-at-stericycle-1089095836/"
browser = create_browser('/usr/chromedriver') #DON'T FORGET TO CHANGE THIS AS YOUR DIRECTORY
browser.get(url)
page_html = browser.page_source
print(page_html[-10:]) #prints dy></html>
Now, you have the whole page. I hope this answers your question!!
Answered By - Anwarvic
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.