Monday, March 14, 2022

[FIXED] Scrapy: Wait for some urls to be parsed, then do something

March 14, 2022 python, scrapy No comments

Issue

I have a spider that needs to find product prices. Those products are grouped together in batches (coming from a database) and it would be nice to have a batch status (RUNNING, DONE) along with start_time and finished_time attributes. So I have something like:

class PriceSpider(scrapy.Spider):
    name = 'prices'

    def start_requests(self):
        for batch in Batches.objects.all():
            batch.started_on = datetime.now()
            batch.status = 'RUNNING'
            batch.save()
            for prod in batch.get_products():
                yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod})
            batch.status = 'DONE'
            batch.finished_on = datetime.now()
            batch.save()  # <-- NOT COOL: This is goind to 
                          # execute before the last product 
                          # url is scraped, right?

    def parse(self, response):
        #...

The problem here is due to the async nature of scrapy, the second status update on the batch object is going to run too soon... right? Is there a way to group these requests together somehow and have the batch object be updated when the last one is parsed?

Solution

I made some adaptations to @Umair suggestion and came up with a solution that works great for my case:

class PriceSpider(scrapy.Spider):
    name = 'prices'

    def start_requests(self):
        for batch in Batches.objects.all():
            batch.started_on = datetime.now()
            batch.status = 'RUNNING'
            batch.save()
            products = batch.get_products()
            counter = {'curr': 0, 'total': len(products)}  # the counter dictionary 
                                                           # for this batch
            for prod in products:
                yield scrapy.Request(product.get_scrape_url(), 
                                     meta={'prod': prod, 
                                           'batch': batch, 
                                           'counter': counter})
                                     # trick = add the counter in the meta dict

    def parse(self, response):
        # process the response as desired
        batch = response.meta['batch']
        counter = response.meta['counter']
        self.increment_counter(batch, counter) # increment counter only after 
                                               # the work is done

    def increment_counter(batch, counter):
        counter['curr'] += 1
        if counter['curr'] == counter['total']:
            batch.status = 'DONE'
            batch.finished_on = datetime.now()
            batch.save()  # GOOD!
                          # Well, almost...

This works fine as long as all the Requests yielded by start_requests have different url's.

If there are any duplicates, scrapy will filter them out and not call your parse method, so you end up with counter['curr'] < counter['total'] and the batch status is left RUNNING forever.

As it turns out you can override scrapy's behaviour for duplicates.

First, we need to change settings.py to specify an alternative "duplicates filter" class:

DUPEFILTER_CLASS = 'myspiders.shopping.MyDupeFilter'

Then we create the MyDupeFilter class, that lets the spider know when there is a duplicate:

class MyDupeFilter(RFPDupeFilter):
    def log(self, request, spider):
        super(MyDupeFilter, self).log(request, spider)
        spider.look_a_dupe(request)

Then we modify our spider to make it increment our counter when a duplicate is found:

class PriceSpider(scrapy.Spider):
    name = 'prices'

    #...

    def look_a_dupe(self, request):
        batch = request.meta['batch']
        counter = request.meta['counter']
        self.increment_counter(batch, counter)

And we are good to go

Answered By - Tony Lâmpada

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, March 14, 2022

[FIXED] Scrapy: Wait for some urls to be parsed, then do something

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels