Issue
I'd like to crawl a certain url which returns a random response each time it's called. Code below returns what I want but I'd like to run it for long time so that I can use the data for an NLP application. This code only runs for once with scrapy crawl the
though I expect it to run more due to last if statement.
Is Unix's start command is what I'm looking for? I tried it but it felt a bit slow. If I had to use start command would opening many tabs in terminal and running same command with start prefix be a good practice or it just throttles the speed?
class TheSpider(scrapy.Spider):
name = 'the'
allowed_domains = ['https://websiteiwannacrawl.com']
start_urls = ['https://websiteiwannacrawl.com']
def parse(self, response):
info = {}
info['text'] = response.css('.pd-text').extract()
yield info
next_page = 'https://websiteiwannacrawl.com'
if next_page is not None:
yield scrapy.Request(next_page, callback=self.parse)
Solution
dont_filter
indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False
You should add this in your Request
yield scrapy.Request(next_page, dont_filter=True)
it's not about your question but for callback=self.parse
please read Parse Method
Answered By - parik
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.