Issue
The website I am scraping data from implemented some mechanism such that if it detects that my request is too frequent, then the account is locked and the request will be redirected to a user validation page where it requires the user to slide a bar in order to unlock.
The sliding bar could be easily solved by selenium ActionChain, however I don't know where to added this functionality in Scrapy.
Basically, in my scrapy spider, for each request I want to:
- Check if the response is the user validation page
If it is the user validation page,
a. I will start a selenium webdriver and send the request again. Then in the webdriver, I will solve the sliding bar to unlock my account.
b. Ask the spider to send the request with the same url again, then the spider keeps scraping data from the response.
If it is not the user validation page, then the spider scrape the data from the response as usual.
You see, in step 2, the scrapy spider will need to request for the same url twice and the selenium webdriver will need to request for the url once. I am not sure how to implement this in the scrapy framework. Any idea?
The following is my spider structure, I am not sure where to add the aforementioned functionality. Or should I use middleware?
class MySpider(scrapy.Spider):
name = 'my_spider'
def start_request(self):
# read urls from external file
urls = [...]
for url in urls:
yield scrapy.Request(url) # the response could be a user validation page
def parse(self, response):
# parse a valid page and scrape data
yield item
--- Update 2018-03-19 ---
I think I found a better way to implement the functionality. I ended up creating a middleware class so that it is reusable and codebase is clean.
Solution
class MySpider(scrapy.Spider):
name = 'my_spider'
def start_request(self):
# read urls from external file
urls = [...]
for url in urls:
yield scrapy.Request(url) # the response could be a user validation page
def parse(self, response):
# check if it's the user validation page
# here i assume you know how to judge if it's a user validation page
if validation page:
#Selenium goes here
browser = webdriver.PhantomJS()
...
yield scrapy.Request(browser.current_url) # send the request again
# not the validation page
else:
#parse the data
yield item
Answered By - just_be_happy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.