Friday, November 12, 2021

[FIXED] Check response before parse in Scrapy spider

November 12, 2021 python, scrapy, selenium No comments

Issue

The website I am scraping data from implemented some mechanism such that if it detects that my request is too frequent, then the account is locked and the request will be redirected to a user validation page where it requires the user to slide a bar in order to unlock.

The sliding bar could be easily solved by selenium ActionChain, however I don't know where to added this functionality in Scrapy.

Basically, in my scrapy spider, for each request I want to:

Check if the response is the user validation page
If it is the user validation page,

a. I will start a selenium webdriver and send the request again. Then in the webdriver, I will solve the sliding bar to unlock my account.

b. Ask the spider to send the request with the same url again, then the spider keeps scraping data from the response.
If it is not the user validation page, then the spider scrape the data from the response as usual.

You see, in step 2, the scrapy spider will need to request for the same url twice and the selenium webdriver will need to request for the url once. I am not sure how to implement this in the scrapy framework. Any idea?

The following is my spider structure, I am not sure where to add the aforementioned functionality. Or should I use middleware?

class MySpider(scrapy.Spider):
    name = 'my_spider'

    def start_request(self):
        # read urls from external file
        urls = [...] 
        for url in urls:
            yield scrapy.Request(url)  # the response could be a user validation page

    def parse(self, response):
        # parse a valid page and scrape data
        yield item

--- Update 2018-03-19 ---

I think I found a better way to implement the functionality. I ended up creating a middleware class so that it is reusable and codebase is clean.

Solution

class MySpider(scrapy.Spider):
    name = 'my_spider'

    def start_request(self):
        # read urls from external file
        urls = [...] 
        for url in urls:
            yield scrapy.Request(url)  # the response could be a user validation page

    def parse(self, response):
        # check if it's the user validation page
        # here i assume you know how to judge if it's a user validation page
        if validation page:
            #Selenium goes here
            browser = webdriver.PhantomJS()
            ...
            yield scrapy.Request(browser.current_url)  # send the request again

        # not the validation page
        else:
            #parse the data
        yield item

Answered By - just_be_happy

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 12, 2021

[FIXED] Check response before parse in Scrapy spider

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels