Tuesday, October 19, 2021

[FIXED] Scrapy CrawlSpider does not perform LinkExtractor if ran via process.crawl()

October 19, 2021 python, scrapy, web-scraping No comments

Issue

I can't figure out why my spider is only crawling the start_url, and ignoring extracting any urls that match the allow parameter.

from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import DropItem
from scrapy.settings import Settings
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = "my_spider"
    allowed_domains = ["website.com/"]
    rules = [Rule(LinkExtractor(allow='/product_page/'), callback='parse', follow=True)]
    start_urls = ["http://www.website.com/list_of_products.php"]    
    custom_settings = {
        "ROBOTSTXT_OBEY": "True",
        "COOKIES_ENABLED": "False",
        "LOG_LEVEL": 'INFO'
    }

    def parse(self, response):
        try:
            item = {
                # populate "item" with data
            }
            yield MyItem(**item)
        except (DropItem, Exception) as e:
            raise DropItem("WARNING: Product item dropped due to obligatory field not being present - %s" % response.url)


if __name__ == '__main__':
    settings = Settings()
    settings.set('ITEM_PIPELINES', {
        'pipelines.csv_pipeline.CsvPipeline': 100
    })
    process = CrawlerProcess(settings)
    process.crawl(MySpider)
    process.start()

I am uncertain if the issue occurs due to it being called from __name__.

Solution

The problem is probably that you're redefining the parse method, which should be avoided. From the crawling rules docs:

Warning

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

So I'd try naming the function something else (I renamed it to parse_item, similar to the CrawlSpider example from the docs, but you can use any name):

class MySpider(CrawlSpider):
    name = "my_spider"
    allowed_domains = ["website.com"]
    rules = [Rule(LinkExtractor(allow='/product_page/.+'), callback='parse_item', follow=True),
             Rule(LinkExtractor(allow='/list_of_products.+'), callback='parse', follow=True)]
    start_urls = ["http://www.website.com/list_of_products.php"]    
    custom_settings = {
        "ROBOTSTXT_OBEY": "True",
        "COOKIES_ENABLED": "False",
        "LOG_LEVEL": 'INFO'
    }

    def parse_item(self, response):
        try:
            item = {
                # populate "item" with data
            }
            yield MyItem(**item)
        except (DropItem, Exception) as e:
            raise DropItem("WARNING: Product item dropped due to obligatory field not being present - %s" % response.url)

Answered By - Ismael Padilla

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, October 19, 2021

[FIXED] Scrapy CrawlSpider does not perform LinkExtractor if ran via process.crawl()

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels