Monday, January 24, 2022

[FIXED] How to iterate to scrape each item no matter the position

January 24, 2022 scrapy, web-scraping No comments

Issue

I'm using scrapy and I'm traying to scrape Technical descriptions from products. But i can't find any tutorial for what i'm looking for.

I'm using this web: Air Conditioner 1

For exemple, i need to extract the model of that product: Modelo ---> KCIN32HA3AN . It's in the 5th place. (//span[@class='gb-tech-spec-module-list-description'])[5]

But if i go this other product: Air Conditioner 2

The model is: Modelo ---> ALS35-WCCR And it's in the 6th position. And i only get this 60 m3 since is the 5th position.

I don't know how to iterate to obtain each model no matter the position.

This is the code i'm using right now

from scrapy.item import Field
from scrapy.item import Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.loader.processors import MapCompose
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader

class Hotel(Item):
    titulo = Field()
    precio = Field()
    marca = Field()
    modelo = Field()

class TripAdvisor(CrawlSpider):
    name = 'Hoteles'

    custom_settings = {
      'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/71.0.3578.80 Chrome/71.0.3578.80 Safari/537.36',
      'CLOSESPIDER_PAGECOUNT': 20
    }

    start_urls = ['https://www.garbarino.com/productos/aires-acondicionados-split/4278']

    download_delay = 2

    rules = (
        Rule(  
            LinkExtractor(
                allow=r'/?page=\d+'
            ), follow=True),

        Rule( 
            LinkExtractor(
                allow=r'/aire-acondicionado-split'
            ), follow=True, callback='parse_items'),
    )

    def parse_items(self, response):
        sel = Selector (response)
        item = ItemLoader(Hotel(), sel)
        
        item.add_xpath('titulo', '//h1/text()')
        item.add_xpath('precio', '//*[@id="final-price"]/text()')
        item.add_xpath('marca', '(//span[@class="gb-tech-spec-module-list-description"])[1]/text()', MapCompose(lambda i: i.replace('\n', ' ').replace('\r', ' ').strip()))
        item.add_xpath('modelo', '(//span[@class="gb-tech-spec-module-list-description"])[5]/text()', MapCompose(lambda i: i.replace('\n', ' ').replace('\r', ' ').strip()))

        yield item.load_item()

Solution

Is not good to take elements by the position, the website could change a lot many times, and that forces you to fix your crawler, in some cases, several times.

But you can use some reference that is most associated with the element that you want than the element position.

For example, I accessed the site you linked and opened this product page, note that the element with the value of modelo should be associated with the element that "presents" the modelo:

<ul>
    <li>
        <h3 class="gb-tech-spec-module-list-title">Modelo</h3>
        <span class="gb-tech-spec-module-list-description">BSI26WCCR</span>
    </li>
    <li>
        <h3 class="gb-tech-spec-module-list-title">Tipo de Tecnología</h3>
        <span class="gb-tech-spec-module-list-description">Inverter</span>
    </li>
    ...
</ul>

So, you can do the following:

//*[contains(text(), "Modelo")]/following-sibling::*[contains(@class, "description")]/text()

In that way, the Xpath does not depends on the position.

Reference to use following-sibling.

Answered By - JPBeckner

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 24, 2022

[FIXED] How to iterate to scrape each item no matter the position

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels