Issue
I have a json file that I want to scrape: https://www.website.com/api/list?limit=50&page=1
I can use 'scrapy.Spider' to crawl all the pages no problem, but if it's possible I prefer to do it with 'CrawlSpider'.
I tried to use:
start_urls=['https://www.website.com']
rules = (
Rule(LinkExtractor(allow=r'/api/list\?.+page=\d+'), callback='parse_page', follow=True),
)
and (just to see if it's even getting the first page):
start_urls=['https://www.website.com']
rules = (
Rule(LinkExtractor(allow=r'/api/list'), callback='parse_page', follow=True),
)
and none of them worked.
Is there a way to do it with 'CrawlSpider'?
Solution
It is not possible with CrawlSpider.
LinkExtractor used to process CrawlSpider Rules -> can extract links only from html responses (not json api) from tags a
and area
Answered By - Georgiy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.