Wednesday, June 15, 2022

[FIXED] Scrapy exclude URLs containing specific text

June 15, 2022 python, scrapy No comments

Issue

I have a problem with a Scrapy Python program I'm trying to build. The code is the following.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class LinkscrawlItem(scrapy.Item):
    link = scrapy.Field()
    attr = scrapy.Field()

class someSpider(CrawlSpider):
  name = 'mysitecrawler'
  item = []

  allowed_domains = ['mysite.co.uk']
  start_urls = ['https://mysite.co.uk/']

  rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
    Rule (LinkExtractor(deny=('my-account', 'cart', 'checkout', 'wp-content')))
  )

  def parse_obj(self,response):
    item = LinkscrawlItem()
    item["link"] = str(response.url)+":"+str(response.status)
    filename = 'links2.txt'
    with open(filename, 'a') as f:
      f.write('\n'+str(response.url)+":"+str(response.status)+'\n')
    self.log('Saved file %s' % filename)

I'm having trouble with the LinkExtractor, for me the deny is meant to exclude from the crawl the list of links I gave it. But it is still crawling them. For the first three the URLs are:

https://mysite.co.uk/my-account/

https://mysite.co.uk/cart/

https://mysite.co.uk/checkout/

The last one is containing wp-content, example:

https://mysite.co.uk/wp-content/uploads/01/22/photo.jpg

Would anyone know what I'm doing wrong with my deny list please?

Thank you

Solution

You have two issues with your code. First, you have two Rules in your crawl spider and you have included the deny restriction in the second rule which never gets checked because the first Rule follows all links and then calls the callback. Therefore the first rule is checked first and therefore it does not exclude the urls you don't want to crawl. Second issue is that in your second rule, you have included the literal string of what you want to avoid scraping but deny expects regular expressions.

Solution is to remove the first rule and slightly change the deny argument by escaping special regex characters in the url such as -. See below sample.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class LinkscrawlItem(scrapy.Item):
    link = scrapy.Field()
    attr = scrapy.Field()

class SomeSpider(CrawlSpider):
    name = 'mysitecrawler'
    allowed_domains = ['mysite.co.uk']
    start_urls = ['https://mysite.co.uk/']

    rules = (
        Rule (LinkExtractor(deny=('my\-account', 'cart', 'checkout', 'wp\-content')), callback="parse_obj", follow=True),
    )

    def parse_obj(self,response):
        item = LinkscrawlItem()
        item["link"] = str(response.url)+":"+str(response.status)
        filename = 'links2.txt'
        with open(filename, 'a') as f:
            f.write('\n'+str(response.url)+":"+str(response.status)+'\n')
        self.log('Saved file %s' % filename)

Answered By - msenior_

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, June 15, 2022

[FIXED] Scrapy exclude URLs containing specific text

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels