Wednesday, November 15, 2023

[FIXED] Page source not rendered by Selenium (or scrapy) with Python but the browser is displaying it correctly

November 15, 2023 python, scrapy, selenium, web-scraping No comments

Issue

I want to automate some tasks from may daily job by webscraping an intranet website that manages a huge amount of data. This intranet site is rendered in JS and to scrape it I tried to use Python with Selenium, but it doesn't work. The page is displayed correctly in the new tab that opens, but when I print the page source, it looks like javascript is not enabled

Bellow you can find my code. http://intranet.page:port/path is just a placeholder.


import os
import undetected_chromedriver as uc
from selenium.webdriver.support.wait import WebDriverWait

def document_initialised(driver):
    return True

os.environ['PATH'] += r"D:/SeleniumDrivers"
driver = uc.Chrome()

driver.get("http://intranet.page:port/path")
WebDriverWait(driver, timeout=10).until(document_initialised)
print(driver.page_source)

I tried also with undetected_chromedriver, but nothing. I tried with a different browser (Edge), same result

I also tried Scrapy, but returns a lot of error, here are the top ones:

>>> fetch('http://intranet.page:port/path')  
2022-12-28 23:08:42 [scrapy.core.engine] INFO: Spider opened
2022-12-28 23:08:42 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://intranet.page:port/path/robots.txt> (referer: None)
2022-12-28 23:08:42 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2022-12-28 23:08:42 [filelock] DEBUG: Attempting to acquire lock 2334601525136 on c:\legacyapp\python\python39\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-12-28 23:08:42 [filelock] DEBUG: Lock 2334601525136 acquired on c:\legacyapp\python\python39\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-12-28 23:08:42 [filelock] DEBUG: Attempting to release lock 2334601525136 on c:\legacyapp\python\python39\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-12-28 23:08:42 [filelock] DEBUG: Lock 2334601525136 released on c:\legacyapp\python\python39\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-12-28 23:08:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://intranet.page:port/path> (referer: None)
2022-12-28 23:08:42 [scrapy.core.scraper] ERROR: Spider error processing <GET http://intranet.page:port/path> (referer: None)
Traceback (most recent call last):

Solution

it was due to iframes, they need to be threated specially (by using switch_to.frame)

Answered By - andrei141592

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 15, 2023

[FIXED] Page source not rendered by Selenium (or scrapy) with Python but the browser is displaying it correctly

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels