Friday, January 21, 2022

[FIXED] Why doesn't the CrawlSpider collect links?

January 21, 2022 python, scrapy, web-scraping No comments

Issue

I am trying to run my first CrawlSpider, but the program terminates without any errors, while it does not return anything, it terminates with zero result. What's wrong with my code?

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class FagorelectrodomesticoSpider(CrawlSpider):
    name = 'fagorelectrodomestico.com'
    allowed_domains = ['fagorelectrodomestico.com']
    start_urls = ['https://fagorelectrodomestico.com']

rules = (
    Rule(LinkExtractor(allow='product/'), callback='parse_item', follow=True),
)

def parse_item(self, response):
    for doc in response.css('a.file'):
        doclink = doc.css('::attr("href")').get()
        product = Product()
        product['model'] = response.css('h2.data__symbol::text').get()
        product['brand'] = 'Fagor'
        product['file_urls'] = [doclink]
        yield product

Solution

The main problem is this page uses JavaScript to add all elements to HTML but Scrapy can't run JavaScript. If you turn off JavaScript in browser and reload this page then you should see empty white page. But there is module scrapy_selenium which can use module Selenium to control real web browser which can run JavaScript (but it will run slower).

Other problem: your rule search links with product/ which I don't see on main page but I can see on pages with categories. But you don't need rule to load other pages and it can't get links product/ from subpages - so it needs another rule to get other links and send to callback parser (which in Spider loads page, searchs all links and checks rules on these links).

And it may need to add /en/ to starting url to get english version which has links with product/. Spanish version has links productos/.

Some code needed to use SeleniumRequest instead of standard Request - I took some code from source code of CrawlSpider and add it with changes.

I also used CrawlerProcess to run code without creating project - so everyone can simply copy it and run python script.py

It downloads files to folder full.

I tested only without option -headless to see what it get in browser. You may have to test with -headless because it may work faster but sometimes it bahave different.

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import scrapy_selenium

class FagorelectrodomesticoSpider(CrawlSpider):

    name = 'fagorelectrodomestico.com'

    allowed_domains = ['fagorelectrodomestico.com']
    start_urls = ['https://fagorelectrodomestico.com/en/']

    rules = (
        Rule(LinkExtractor(allow='/en/product/'), callback='parse_item', follow=True),
        Rule(LinkExtractor(allow='/en/', deny='/en/product/'), callback='parse', follow=True),
    )

    def start_requests(self):
        print('[start_requests]')
        for url in self.start_urls:
            print('[start_requests] url:', url)            
            yield scrapy_selenium.SeleniumRequest(url=url, callback=self.parse)

    def parse(self, response):
        print('[parse] url:', response.url)
        
        for rule_index, rule in enumerate(self._rules):
            #print(rule.callback)
            for link in rule.link_extractor.extract_links(response):
                yield scrapy_selenium.SeleniumRequest(
                    url=link.url,
                    callback=rule.callback,
                    errback=rule.errback,
                    meta=dict(rule=rule_index, link_text=link.text),
                )
            
    def parse_item(self, response):
        print('[parse_item] url:', response.url)
        
        for doc in response.css('a.file'):
            doclink = doc.css('::attr("href")').get()
            product = {
                'model': response.css('h2.data__symbol::text').get(),
                'brand': 'Fagor',
                'file_urls': [doclink],
            }
            yield product
        

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    #'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1

    'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1},   # used standard FilesPipeline (download to FILES_STORE/full)
    #'FILES_STORE': '/path/to/valid/dir',  # this folder has to exist before downloading
    'FILES_STORE': '.',                   # this folder has to exist before downloading

    'SELENIUM_DRIVER_NAME': 'firefox',
    'SELENIUM_DRIVER_EXECUTABLE_PATH': '/home/furas/bin/geckodriver',
    #'SELENIUM_DRIVER_ARGUMENTS': ['-headless'], # '--headless' if using chrome instead of firefox
    'SELENIUM_DRIVER_ARGUMENTS': [],
    #'SELENIUM_BROWSER_EXECUTABLE_PATH': '',
    #'SELENIUM_COMMAND_EXECUTOR': '',
    
    'DOWNLOADER_MIDDLEWARES': {'scrapy_selenium.SeleniumMiddleware': 800}
})
c.crawl(FagorelectrodomesticoSpider)
c.start()

Answered By - furas

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 21, 2022

[FIXED] Why doesn't the CrawlSpider collect links?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels