Issue
I can't figure out why my spider is only crawling the start_url
, and ignoring extracting any urls that match the allow
parameter.
from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import DropItem
from scrapy.settings import Settings
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = "my_spider"
allowed_domains = ["website.com/"]
rules = [Rule(LinkExtractor(allow='/product_page/'), callback='parse', follow=True)]
start_urls = ["http://www.website.com/list_of_products.php"]
custom_settings = {
"ROBOTSTXT_OBEY": "True",
"COOKIES_ENABLED": "False",
"LOG_LEVEL": 'INFO'
}
def parse(self, response):
try:
item = {
# populate "item" with data
}
yield MyItem(**item)
except (DropItem, Exception) as e:
raise DropItem("WARNING: Product item dropped due to obligatory field not being present - %s" % response.url)
if __name__ == '__main__':
settings = Settings()
settings.set('ITEM_PIPELINES', {
'pipelines.csv_pipeline.CsvPipeline': 100
})
process = CrawlerProcess(settings)
process.crawl(MySpider)
process.start()
I am uncertain if the issue occurs due to it being called from __name__
.
Solution
The problem is probably that you're redefining the parse method, which should be avoided. From the crawling rules docs:
Warning
When writing crawl spider rules, avoid using
parse
as callback, since theCrawlSpider
uses theparse
method itself to implement its logic. So if you override theparse
method, the crawl spider will no longer work.
So I'd try naming the function something else (I renamed it to parse_item
, similar to the CrawlSpider
example from the docs, but you can use any name):
class MySpider(CrawlSpider):
name = "my_spider"
allowed_domains = ["website.com"]
rules = [Rule(LinkExtractor(allow='/product_page/.+'), callback='parse_item', follow=True),
Rule(LinkExtractor(allow='/list_of_products.+'), callback='parse', follow=True)]
start_urls = ["http://www.website.com/list_of_products.php"]
custom_settings = {
"ROBOTSTXT_OBEY": "True",
"COOKIES_ENABLED": "False",
"LOG_LEVEL": 'INFO'
}
def parse_item(self, response):
try:
item = {
# populate "item" with data
}
yield MyItem(**item)
except (DropItem, Exception) as e:
raise DropItem("WARNING: Product item dropped due to obligatory field not being present - %s" % response.url)
Answered By - Ismael Padilla
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.