Issue
I wrote a simple spider which I want to follow all links within a domain (in this example amazon.com) this is my code so far
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from urllib.parse import urlparse
from scrapy.utils.response import open_in_browser
class ExampleSpider(CrawlSpider):
name = 'example'
allowed_domains = ['amazon.com']
rules = (
Rule(LinkExtractor(allow='',
deny_extensions=['7z', '7zip', 'apk', 'bz2', 'cdr,' 'dmg', 'ico,' 'iso,' 'tar', 'tar.gz','pdf','docx'],
), callback='parse_item', follow=True,
),
)
custom_settings = {'LOG_ENABLED':True}
def start_requests(self):
#print(self.website)
url = 'https://www.amazon.com/s?k=water+balloons'
yield scrapy.Request(url,callback=self.parse_item,)
def parse_item(self,response):
#open_in_browser(response)
print(response.url)
I checked this question but the answer didn't work scrapy follow all the links and get status, I also tried replacing allow=''
with restrict_xpaths='\\a'
but it didn't solve, any help is appreciated.
Note: It is important that spider stays within "amazon.com" domain
Solution
You have specified the rules correctly but the problem with your code is that you are not calling the proper method inside your start_requests
method.
In order for the rules to trigger you need to send the first request to the built-in parse
method
Something like this:
def start_requests(self):
#print(self.website)
url = 'https://www.amazon.com/s?k=water+balloons'
yield scrapy.Request(url,callback=self.parse)
Answered By - asimhashmi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.