Issue
I am building a project where I need a web crawler which crawls a list of different webpages. This list can change at any time. How is this best implemented with scrapy? Should I create one spider for all websites or dynamically create spiders?
I have read about scrapyd, and I guess that dynamically creating spiders is the best approach. I would need a hint about how to implement it though.
Solution
If parsing logic is same then there are two methods,
- For large number of webpages, you can create a list and read that list at the start may b in start_requests method or in constructor and assign that list to start_urls
- You can pass you webpage link as a parameter to the spider from command line arguments, ans same in requests_method or in constructor you can access this parameter and assign it to start_urls
Passing parameters in scrapy
scrapy crawl spider_name -a start_url=your_url
In scrapyd replace -a with -d
Answered By - Tasawer Nawaz
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.