Issue
I want to automate some tasks from may daily job by webscraping an intranet website that manages a huge amount of data. This intranet site is rendered in JS and to scrape it I tried to use Python with Selenium, but it doesn't work. The page is displayed correctly in the new tab that opens, but when I print the page source, it looks like javascript is not enabled
Bellow you can find my code. http://intranet.page:port/path
is just a placeholder.
import os
import undetected_chromedriver as uc
from selenium.webdriver.support.wait import WebDriverWait
def document_initialised(driver):
return True
os.environ['PATH'] += r"D:/SeleniumDrivers"
driver = uc.Chrome()
driver.get("http://intranet.page:port/path")
WebDriverWait(driver, timeout=10).until(document_initialised)
print(driver.page_source)
I tried also with undetected_chromedriver, but nothing. I tried with a different browser (Edge), same result
I also tried Scrapy, but returns a lot of error, here are the top ones:
>>> fetch('http://intranet.page:port/path')
2022-12-28 23:08:42 [scrapy.core.engine] INFO: Spider opened
2022-12-28 23:08:42 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://intranet.page:port/path/robots.txt> (referer: None)
2022-12-28 23:08:42 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2022-12-28 23:08:42 [filelock] DEBUG: Attempting to acquire lock 2334601525136 on c:\legacyapp\python\python39\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-12-28 23:08:42 [filelock] DEBUG: Lock 2334601525136 acquired on c:\legacyapp\python\python39\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-12-28 23:08:42 [filelock] DEBUG: Attempting to release lock 2334601525136 on c:\legacyapp\python\python39\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-12-28 23:08:42 [filelock] DEBUG: Lock 2334601525136 released on c:\legacyapp\python\python39\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-12-28 23:08:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://intranet.page:port/path> (referer: None)
2022-12-28 23:08:42 [scrapy.core.scraper] ERROR: Spider error processing <GET http://intranet.page:port/path> (referer: None)
Traceback (most recent call last):
Solution
it was due to iframes, they need to be threated specially (by using switch_to.frame)
Answered By - andrei141592
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.