Issue
I keep getting an error when trying to scrapy several urls with scrapy when using Selenium Middleware.
Middleware.py:
class SeleniumMiddleWare(object):
def __init__(self):
path = "G:/Downloads/chromedriver.exe"
options = uc.ChromeOptions()
options.headless=True
chrome_prefs = {}
options.experimental_options["prefs"] = chrome_prefs
chrome_prefs["profile.default_content_settings"] = {"images": 2}
chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}
self.driver= uc.Chrome(options= options, use_subprocess=True, driver_executable_path = path)
def process_request(self, request, spider):
try:
self.driver.get(request.url)
except:
pass
content = self.driver.page_source
self.driver.quit()
return HtmlResponse(request.url, encoding='utf-8', body=content, request=request)
def process_response(self, request, response, spider):
return response
Spider.py:
class SeleniumSpider(scrapy.Spider):
name = 'steamdb'
#allowed_domains = ['steamdb.info']
start_urls = ['https://steamdb.info/graph/']
def parse(self, response):
table = response.xpath('//*[@id="table-apps"]/tbody')
rows = table.css('tr[class= "app"]')
#b= a.css('tr [class = "app"]::text')
#table = b.xpath('//*[@id="table-apps"]/tbody/tr')
for element in rows:
link = "https://steamdb.info".join(element.css('::attr(href)').get())
name = element.css('a ::text')[0].get()
game_info = {"Link": link, "Name": name}
yield scrapy.Request(url =link, callback = self.parse_info, cb_kwargs= dict(game_info = game_info))
def parse_info(self, response, game_info ):
game_info["sales"] = response.xpath('//*[@id="graphs"]/div[5]/div[2]/ul/li[1]/strong/span/text()').getall()
yield game_info
Note: The scraper works without using cb_kwargs
in making a new request and following the links. If I only scrape the pages in start_urls
, it works, but not when I make new requests to other urls or follow pages.
The error:
2022-07-12 20:53:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://steamdb.info/graph/> (referer: https://steamdb.info/graph/)
2022-07-12 20:53:54 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://localhost:52304/session/99578d3d4f168c77b58a85f67be06927/execute/sync {"script": "return navigator.webdriver", "args": []}
2022-07-12 20:53:54 [urllib3.connectionpool] DEBUG: Resetting dropped connection: localhost
2022-07-12 20:53:56 [urllib3.util.retry] DEBUG: Incremented Retry for (url='/session/99578d3d4f168c77b58a85f67be06927/execute/sync'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
2022-07-12 20:53:56 [urllib3.connectionpool] WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000015E5EB66EC0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/99578d3d4f168c77b58a85f67be06927/execute/sync
2022-07-12 20:53:56 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (2): localhost:52304
2022-07-12 20:53:58 [urllib3.util.retry] DEBUG: Incremented Retry for (url='/session/99578d3d4f168c77b58a85f67be06927/execute/sync'): Retry(total=1, connect=None, read=None, redirect=None, status=None)
2022-07-12 20:53:58 [urllib3.connectionpool] WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000015E5ED6C970>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/99578d3d4f168c77b58a85f67be06927/execute/sync
Solution
the target machine actively refused it
means that the server responds, but the specified port (52304) is closed. Could you check if you can access it? Maybe it is a local firewall that is blocking it?
UPD: it looks like you're calling self.driver.quit()
in each process_request
, either reinit the driver or don't call .quit()
until it is finished
Answered By - svfat
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.