Issue
I have a spider that needs to find product prices. Those products are grouped together in batches (coming from a database) and it would be nice to have a batch status (RUNNING, DONE) along with start_time
and finished_time
attributes.
So I have something like:
class PriceSpider(scrapy.Spider):
name = 'prices'
def start_requests(self):
for batch in Batches.objects.all():
batch.started_on = datetime.now()
batch.status = 'RUNNING'
batch.save()
for prod in batch.get_products():
yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod})
batch.status = 'DONE'
batch.finished_on = datetime.now()
batch.save() # <-- NOT COOL: This is goind to
# execute before the last product
# url is scraped, right?
def parse(self, response):
#...
The problem here is due to the async nature of scrapy, the second status update on the batch object is going to run too soon... right? Is there a way to group these requests together somehow and have the batch object be updated when the last one is parsed?
Solution
I made some adaptations to @Umair suggestion and came up with a solution that works great for my case:
class PriceSpider(scrapy.Spider):
name = 'prices'
def start_requests(self):
for batch in Batches.objects.all():
batch.started_on = datetime.now()
batch.status = 'RUNNING'
batch.save()
products = batch.get_products()
counter = {'curr': 0, 'total': len(products)} # the counter dictionary
# for this batch
for prod in products:
yield scrapy.Request(product.get_scrape_url(),
meta={'prod': prod,
'batch': batch,
'counter': counter})
# trick = add the counter in the meta dict
def parse(self, response):
# process the response as desired
batch = response.meta['batch']
counter = response.meta['counter']
self.increment_counter(batch, counter) # increment counter only after
# the work is done
def increment_counter(batch, counter):
counter['curr'] += 1
if counter['curr'] == counter['total']:
batch.status = 'DONE'
batch.finished_on = datetime.now()
batch.save() # GOOD!
# Well, almost...
This works fine as long as all the Requests yielded by start_requests have different url's.
If there are any duplicates, scrapy will filter them out and not call your parse
method,
so you end up with counter['curr'] < counter['total']
and the batch status is left RUNNING forever.
As it turns out you can override scrapy's behaviour for duplicates.
First, we need to change settings.py to specify an alternative "duplicates filter" class:
DUPEFILTER_CLASS = 'myspiders.shopping.MyDupeFilter'
Then we create the MyDupeFilter
class, that lets the spider know when there is a duplicate:
class MyDupeFilter(RFPDupeFilter):
def log(self, request, spider):
super(MyDupeFilter, self).log(request, spider)
spider.look_a_dupe(request)
Then we modify our spider to make it increment our counter when a duplicate is found:
class PriceSpider(scrapy.Spider):
name = 'prices'
#...
def look_a_dupe(self, request):
batch = request.meta['batch']
counter = request.meta['counter']
self.increment_counter(batch, counter)
And we are good to go
Answered By - Tony Lâmpada
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.