Wednesday, October 20, 2021

[FIXED] Scrapy: Continuously scrape certain link pattern without repetition

October 20, 2021 python, scrapy No comments

Issue

I have the following link pattern: https://www.somewebsite.com/api-internal/v1/festivals/1/ at the end where is currently /1/ that's the pk. My goal is to scrape weekly every pk of that API. Currently, there are 100 entries, next week it might be 110.

Is there a way to scrape this week the first 100 and next week the "new" 10 entries, without scraping the old 100 entries I already did?

HTTP 200 OK
Allow: GET, HEAD, OPTIONS
Content-Type: application/json
Vary: Accept

{
    "pk": 1,
    "name": "Stop Making Sense",
    "theme": "purple",
    "slug": "stop-making-sense",
    "series": {
        "pk": 9,
        "name": "Stop Making Sense",
        "slug": "stop-making-sense"
    },
    "edition": "2012",
    "is_active": false,
    "featured": false,
    "listed": false,
    "start": "2012-08-02",
    "end": "2012-08-06",
    "date_unconfirmed": false,
    "url": "https://www.somewebsite.com/entries/stop-making-sense/2012/"
}

Solution

One of options to don't load previously scraped is usage of scrapy HttpCacheMiddleware

scrapy HttpCacheMiddleware stores each request and response in local folder so on next scraper run - scraper will download responses only for new urls (for previously scraped pages - responses from local 'httpcahce' will be used)

To completely skip previously scraped results we can add conditional statement and check response for download_latency meta key (responses from httpcache don't have it). As result Spider will return items from new urls only.

As You can't know how many ids website will have in future. There is option to start from id 1 and continiously increasing it during process in parse method instead of populating start_urls list with something like [f".../{str(id)}" for id in range(1,100)]
If on some point (for example during processing id - 110) - application return 404 - request for url with id 111 - will not be scheduled.

Result will be something like this:

import scrapy
import json

class SomeApiSpider(scrapy.Spider):
    name = "api"
    custom_settings = {"HTTPCACHE_ENABLED": True }
    start_urls = ['https://www.somewebsite.com/api-internal/v1/festivals/1/']

    def parse(self, response):
        id = str(int(response.url.split("/")[-2]) + 1)   # get next id number
        if "download_latency" in response.meta.keys():
            data = json,loads(response.body)
            ...   # do your stuff
        yield scrapy.Request(url = f"https://www.somewebsite.com/api-internal/v1/festivals/{id}/",
            callback=self.parse)

Answered By - Georgiy

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, October 20, 2021

[FIXED] Scrapy: Continuously scrape certain link pattern without repetition

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels