Issue
I'd like to scrape newspages / blogs (anything, which contains new informations on a daily basis).
My Crawler works fine and does everything, I kindly asked him to do.
But I cannot find a proper solution to the circumstance, that I'd like him to ignore already scraped urls (or items to keep it more general) and just add new urls/items to an already existing json/csv file.
I've seen many solutions here to check, whether an item exists in a csv file.. but none of this "solutions" did really work.
Scrapy DeltaFetch seems to cannot be installed on my system... I've get errors af. and all the hints, like e.g. $ sudo pip install bsddb3, upgrade this and update that.. etc.. does not do the trick. (tried it for 3 hours now and fed up with solutionfinding for a package, which wasn't updated since 2017).
I hope, that you have a handy and practical solution.
Thank you very much in advance!
Best regards!
Solution
An option could be a custom downloader middleware with the following:
- A process_response that puts the url you crawled in a database
- A process_request method that checks if the url is present in the database. If it's in there, you raise an IgnoreRequest so the request is not going through anymore.
Answered By - Wim Hermans
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.