Issue
I have the following link pattern: https://www.somewebsite.com/api-internal/v1/festivals/1/ at the end where is currently /1/
that's the pk. My goal is to scrape weekly every pk of that API. Currently, there are 100 entries, next week it might be 110.
Is there a way to scrape this week the first 100 and next week the "new" 10 entries, without scraping the old 100 entries I already did?
HTTP 200 OK
Allow: GET, HEAD, OPTIONS
Content-Type: application/json
Vary: Accept
{
"pk": 1,
"name": "Stop Making Sense",
"theme": "purple",
"slug": "stop-making-sense",
"series": {
"pk": 9,
"name": "Stop Making Sense",
"slug": "stop-making-sense"
},
"edition": "2012",
"is_active": false,
"featured": false,
"listed": false,
"start": "2012-08-02",
"end": "2012-08-06",
"date_unconfirmed": false,
"url": "https://www.somewebsite.com/entries/stop-making-sense/2012/"
}
Solution
One of options to don't load previously scraped is usage of scrapy HttpCacheMiddleware
scrapy HttpCacheMiddleware stores each request and response in local folder so on next scraper run - scraper will download responses only for new urls (for previously scraped pages - responses from local 'httpcahce' will be used)
To completely skip previously scraped results we can add conditional statement and check response for download_latency
meta key (responses from httpcache
don't have it). As result Spider will return items from new urls only.
As You can't know how many ids website will have in future. There is option to start from id 1
and continiously increasing it during process in parse
method instead of populating start_urls
list with something like [f".../{str(id)}" for id in range(1,100)]
If on some point (for example during processing id - 110) - application return 404 - request for url with id 111 - will not be scheduled.
Result will be something like this:
import scrapy
import json
class SomeApiSpider(scrapy.Spider):
name = "api"
custom_settings = {"HTTPCACHE_ENABLED": True }
start_urls = ['https://www.somewebsite.com/api-internal/v1/festivals/1/']
def parse(self, response):
id = str(int(response.url.split("/")[-2]) + 1) # get next id number
if "download_latency" in response.meta.keys():
data = json,loads(response.body)
... # do your stuff
yield scrapy.Request(url = f"https://www.somewebsite.com/api-internal/v1/festivals/{id}/",
callback=self.parse)
Answered By - Georgiy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.