Issue
I have a functioning scrapy crawlspider that scrapes information from pages such as https://www.parkrun.org.uk/marple/results/weeklyresults/?runSeqNumber=100. I have saved the data collected already in csv format.
These events are repeated on a weekly basis and I'd like to collect new information only, rather than crawling all previous events. While this will speed up the process for me, my main motivation is to avoid making unnecessary requests to the website.
I have experimented with Deltafetch, but it seems to throw up errors in the information I scrape, with lots of duplication of individual runner times and other strange results.
My preference is to use Middlewares to check against a list of event URLs previously scraped (stored in a csv file or similar) and to prevent requests being made to those URLs, even if they satisfy the rules defined in my crawlspider.py.
I'm not sure how best to implement this, and which section of the middlewares.py file to use to avoid the requests being made, rather than simply not downloading data after the pay has already been visited.
Any help you can offer would be really appreciated.
Solution
create custom spider middlewares
writing-your-own-spider-middleware
store the url in set before yield the request, than in custom spider middlewares process_spider_input
check the url is in the set, if it is there, just discard.
don't forget to enable the middlewares in setting.py.
Answered By - liu alex
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.