Issue
I have been working on crawling a site and am stuck. I need to begin at a url and crawl all subsequent urls that stem from it, and all urls that stem from those, etc getting as granular as possible. I've searched previous posts and still cannot find a solution to my problem.
After running my code, it scrapes the links from my starting url, but it does not crawl the links on the starting url. I have been trying to solve this for some time and cannot.
I believe my issue may be with the rule I'm setting but not sure. I've removed allowed domains as well as made sure I've included, follow=True, into my code and I know that including callback=parse would be a problem but it appears that including callback='parse_item' is not and I've also tried removing it which makes no difference.
I have not made changes to the pipeline or settings folder, as I'm not sure what I'd do to them. My item is pretty standard, it is one field and to load it I use, from hospitalone.items import HospitaloneItem. The rest of my code is below:
class HopitaloneSpider(CrawlSpider):
name = 'example.org'
#allowed_domains = ['http://>example.org/']
start_urls = [
'http://example.org/'
]
rules = (
#Rule(SgmlLinkExtractor(allow='>example\.org', )),
Rule(SgmlLinkExtractor(allow=('\w+$', )), callback='parse_item', follow=True),
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = HospitaloneItem()
item['name'] = hxs.select('//a').extract()
return item
Any help would be great. Thanks so much.
Solution
I think the problem is since you are using CrawlSpider, you mustn't override parse method. Here is a link that might help you: http://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules
Follow the example and implement a different parse function
Answered By - minus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.