Issue
I have a problem with a Scrapy Python program I'm trying to build. The code is the following.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class LinkscrawlItem(scrapy.Item):
link = scrapy.Field()
attr = scrapy.Field()
class someSpider(CrawlSpider):
name = 'mysitecrawler'
item = []
allowed_domains = ['mysite.co.uk']
start_urls = ['https://mysite.co.uk/']
rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
Rule (LinkExtractor(deny=('my-account', 'cart', 'checkout', 'wp-content')))
)
def parse_obj(self,response):
item = LinkscrawlItem()
item["link"] = str(response.url)+":"+str(response.status)
filename = 'links2.txt'
with open(filename, 'a') as f:
f.write('\n'+str(response.url)+":"+str(response.status)+'\n')
self.log('Saved file %s' % filename)
I'm having trouble with the LinkExtractor, for me the deny is meant to exclude from the crawl the list of links I gave it. But it is still crawling them. For the first three the URLs are:
https://mysite.co.uk/my-account/
https://mysite.co.uk/checkout/
The last one is containing wp-content, example:
https://mysite.co.uk/wp-content/uploads/01/22/photo.jpg
Would anyone know what I'm doing wrong with my deny list please?
Thank you
Solution
You have two issues with your code. First, you have two Rules
in your crawl spider and you have included the deny restriction in the second rule which never gets checked because the first Rule follows all links and then calls the callback. Therefore the first rule is checked first and therefore it does not exclude the urls you don't want to crawl. Second issue is that in your second rule, you have included the literal string of what you want to avoid scraping but deny
expects regular expressions.
Solution is to remove the first rule and slightly change the deny
argument by escaping special regex characters in the url such as -
. See below sample.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class LinkscrawlItem(scrapy.Item):
link = scrapy.Field()
attr = scrapy.Field()
class SomeSpider(CrawlSpider):
name = 'mysitecrawler'
allowed_domains = ['mysite.co.uk']
start_urls = ['https://mysite.co.uk/']
rules = (
Rule (LinkExtractor(deny=('my\-account', 'cart', 'checkout', 'wp\-content')), callback="parse_obj", follow=True),
)
def parse_obj(self,response):
item = LinkscrawlItem()
item["link"] = str(response.url)+":"+str(response.status)
filename = 'links2.txt'
with open(filename, 'a') as f:
f.write('\n'+str(response.url)+":"+str(response.status)+'\n')
self.log('Saved file %s' % filename)
Answered By - msenior_
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.