Issue
I'm using scrapy and I'm traying to scrape Technical descriptions from products. But i can't find any tutorial for what i'm looking for.
I'm using this web: Air Conditioner 1
For exemple, i need to extract the model of that product:
Modelo ---> KCIN32HA3AN
. It's in the 5th place.
(//span[@class='gb-tech-spec-module-list-description'])[5]
But if i go this other product: Air Conditioner 2
The model is: Modelo ---> ALS35-WCCR
And it's in the 6th position. And i only get this 60 m3
since is the 5th position.
I don't know how to iterate to obtain each model no matter the position.
This is the code i'm using right now
from scrapy.item import Field
from scrapy.item import Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.loader.processors import MapCompose
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
class Hotel(Item):
titulo = Field()
precio = Field()
marca = Field()
modelo = Field()
class TripAdvisor(CrawlSpider):
name = 'Hoteles'
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/71.0.3578.80 Chrome/71.0.3578.80 Safari/537.36',
'CLOSESPIDER_PAGECOUNT': 20
}
start_urls = ['https://www.garbarino.com/productos/aires-acondicionados-split/4278']
download_delay = 2
rules = (
Rule(
LinkExtractor(
allow=r'/?page=\d+'
), follow=True),
Rule(
LinkExtractor(
allow=r'/aire-acondicionado-split'
), follow=True, callback='parse_items'),
)
def parse_items(self, response):
sel = Selector (response)
item = ItemLoader(Hotel(), sel)
item.add_xpath('titulo', '//h1/text()')
item.add_xpath('precio', '//*[@id="final-price"]/text()')
item.add_xpath('marca', '(//span[@class="gb-tech-spec-module-list-description"])[1]/text()', MapCompose(lambda i: i.replace('\n', ' ').replace('\r', ' ').strip()))
item.add_xpath('modelo', '(//span[@class="gb-tech-spec-module-list-description"])[5]/text()', MapCompose(lambda i: i.replace('\n', ' ').replace('\r', ' ').strip()))
yield item.load_item()
Solution
Is not good to take elements by the position, the website could change a lot many times, and that forces you to fix your crawler, in some cases, several times.
But you can use some reference that is most associated with the element that you want than the element position.
For example, I accessed the site you linked and opened this product page, note that the element with the value of modelo
should be associated with the element that "presents" the modelo
:
<ul>
<li>
<h3 class="gb-tech-spec-module-list-title">Modelo</h3>
<span class="gb-tech-spec-module-list-description">BSI26WCCR</span>
</li>
<li>
<h3 class="gb-tech-spec-module-list-title">Tipo de TecnologĂa</h3>
<span class="gb-tech-spec-module-list-description">Inverter</span>
</li>
...
</ul>
So, you can do the following:
//*[contains(text(), "Modelo")]/following-sibling::*[contains(@class, "description")]/text()
In that way, the Xpath
does not depends on the position.
- Reference to use
following-sibling
.
Answered By - JPBeckner
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.