Friday, August 5, 2022

[FIXED] How to fix scrapy resetting connection

August 05, 2022 python, scrapy No comments

Issue

I keep getting an error when trying to scrapy several urls with scrapy when using Selenium Middleware.

Middleware.py:

class SeleniumMiddleWare(object):

    def __init__(self):
        path = "G:/Downloads/chromedriver.exe"
        options = uc.ChromeOptions()
        options.headless=True
        chrome_prefs = {}
        options.experimental_options["prefs"] = chrome_prefs
        chrome_prefs["profile.default_content_settings"] = {"images": 2}
        chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}
        self.driver=  uc.Chrome(options= options, use_subprocess=True, driver_executable_path = path)
       

    def process_request(self, request, spider):
        try:
            self.driver.get(request.url)
        except:
            pass
        content = self.driver.page_source
        self.driver.quit()

        return HtmlResponse(request.url, encoding='utf-8', body=content, request=request)

    def process_response(self, request, response, spider):
        return response

Spider.py:

class SeleniumSpider(scrapy.Spider):
    name = 'steamdb'
    #allowed_domains = ['steamdb.info']
    start_urls = ['https://steamdb.info/graph/']
    
    def parse(self, response):  
        table = response.xpath('//*[@id="table-apps"]/tbody')
        rows = table.css('tr[class= "app"]')
        #b= a.css('tr [class = "app"]::text')
        #table = b.xpath('//*[@id="table-apps"]/tbody/tr')

        for element in rows:
            link = "https://steamdb.info".join(element.css('::attr(href)').get())
            name = element.css('a ::text')[0].get()
            game_info = {"Link": link, "Name": name}
            yield scrapy.Request(url =link, callback = self.parse_info, cb_kwargs= dict(game_info = game_info))

    
    def parse_info(self, response, game_info ):
        game_info["sales"] = response.xpath('//*[@id="graphs"]/div[5]/div[2]/ul/li[1]/strong/span/text()').getall()
        yield game_info

Note: The scraper works without using cb_kwargs in making a new request and following the links. If I only scrape the pages in start_urls, it works, but not when I make new requests to other urls or follow pages.

The error:

2022-07-12 20:53:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://steamdb.info/graph/> (referer: https://steamdb.info/graph/)
2022-07-12 20:53:54 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://localhost:52304/session/99578d3d4f168c77b58a85f67be06927/execute/sync {"script": "return navigator.webdriver", "args": []}
2022-07-12 20:53:54 [urllib3.connectionpool] DEBUG: Resetting dropped connection: localhost
2022-07-12 20:53:56 [urllib3.util.retry] DEBUG: Incremented Retry for (url='/session/99578d3d4f168c77b58a85f67be06927/execute/sync'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
2022-07-12 20:53:56 [urllib3.connectionpool] WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000015E5EB66EC0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/99578d3d4f168c77b58a85f67be06927/execute/sync
2022-07-12 20:53:56 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (2): localhost:52304
2022-07-12 20:53:58 [urllib3.util.retry] DEBUG: Incremented Retry for (url='/session/99578d3d4f168c77b58a85f67be06927/execute/sync'): Retry(total=1, connect=None, read=None, redirect=None, status=None)
2022-07-12 20:53:58 [urllib3.connectionpool] WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000015E5ED6C970>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/99578d3d4f168c77b58a85f67be06927/execute/sync

Solution

the target machine actively refused it means that the server responds, but the specified port (52304) is closed. Could you check if you can access it? Maybe it is a local firewall that is blocking it?

UPD: it looks like you're calling self.driver.quit() in each process_request, either reinit the driver or don't call .quit() until it is finished

Answered By - svfat

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, August 5, 2022

[FIXED] How to fix scrapy resetting connection

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels