Issue
I am trying to retrieve the html code of a site using a headless chrome driver. However I get a "permission denied" message. If I use a "regular" driver it all works fine.
Is there any way to bypass that?
It's my first post so I do apologize for any potential mistakes in formatting
from selenium import webdriver
#Headless driver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
driver1 = webdriver.Chrome(executable_path='./chromedriver', options=chrome_options,
service_args=['--verbose', '--log-path=/tmp/chromedriver.log'])
driver1.get('https://www.size.co.uk/')
html = driver1.page_source
html
The message I get is:
<html xmlns="http://www.w3.org/1999/xhtml"><head>\n<title>Access Denied</title>\n</head><body>\n<h1>Access Denied</h1>\n \nYou don\'t have permission to access "http://www.size.co.uk/" on this server.<p>\nReference #18.ac81655f.1548818550.73b12da\n\n\n</p></body></html>
Regular driver:
driver = webdriver.Chrome('./chromedriver')
driver.get('https://www.size.co.uk/')
html = driver.page_source
driver.quit()
html
Ideally, I'd like the output to be as in the latter case without having new windows popping up every couple seconds.
Solution
Adding in the following code snippet got the page to return for me:
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'
chrome_options.add_argument('user-agent={0}'.format(user_agent))
The site is obviously checking for headless browsers and then denying them access. Here's an article on avoiding detection: Making Chrome Headless Undetectable
To get the user agent being used by the driver you can run the following command:
driver.execute_script("return navigator.userAgent")
Chromes headless user agent is something like this:
u'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/71.0.3578.98 Safari/537.36'
Answered By - cullzie
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.