Monday, January 24, 2022

[FIXED] How scrape items from 2 different sections?

January 24, 2022 mysql, python, scrapy, web-scraping, xpath No comments

Issue

im new with Scrapy and web crawling and I've been working on the page www.mercadolibre.com.mx I have to get (from the startpage) some data (descripton and prices) about the produtcs displayed in there. Here is my items.py:

from scrapy.item import Item, Field

class PruebaMercadolibreItem(Item):
    producto = Field()
    precio = Field()

And here is my spider:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from prueba_mercadolibre.items import PruebaMercadolibreItem

class MLSpider(BaseSpider):
    name = "mlspider"
    allowed_domains = ["mercadolibre.com"]
    start_urls = ["http://www.mercadolibre.com.mx"]

    def parse (self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//div[@class='item-data']")
        items = []
        for titles in titles:
            item = PruebaMercadolibreItem()
            item["producto"] = titles.select("p[@class='tit    le']/@title").extract()
            item["precio"] = titles.select("span[@class='ch-price']/text()").extract()
            items.append(item)
        return items

The problem is that I get the same results in when I change this line:

    titles = hxs.select("//div[@class='item-data']")

To this:

    titles = hxs.select("//div[@class='item-data'] | //div[@class='item-data item-data-mp']")

And Im not getting the same data as when I use the first line.

Can anyone help me? do I have any errorin my xPath selection?

Also I cant find a good tutorial for using MySQL with scrapy, I would appreciate any help. Thx

Solution

Better use contains if you want to get all div tags containing item-data class:

titles = hxs.select("//div[contains(@class, 'item-data')]")

Also, you have other problems in the spider:

the loop, you are overriding the titles
class name in producto xpath should be title, not tit le
you probably don't want to have lists in Field values, get the first items out of the extracted lists
HtmlXPathSelector is deprecated, use Selector instead
select() is deprecated, use xpath() instead
BaseSpider has been renamed to Spider

Here's the code with modifications:

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.item import Item, Field
from prueba_mercadolibre.items import PruebaMercadolibreItem    


class MLSpider(Spider):
    name = "mlspider"
    allowed_domains = ["mercadolibre.com"]
    start_urls = ["http://www.mercadolibre.com.mx"]

    def parse (self, response):
        hxs = Selector(response)
        titles = hxs.xpath("//div[contains(@class, 'item-data')]")
        for title in titles:
            item = PruebaMercadolibreItem()
            item["producto"] = title.xpath("p[@class='title']/@title").extract()[0]
            item["precio"] = title.xpath("span[@class='ch-price']/text()").extract()[0]
            yield item

Example items from the output:

{'precio': u'$ 35,000', 'producto': u'Cuatrimoto, Utv De 500cc 4x4 ,moto , Motos, Atv ,'}
{'precio': u'$ 695', 'producto': u'Reloj Esp\xeda Camara Oculta Video Hd 16 Gb! Sony Compara.'}

Answered By - alecxe

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 24, 2022

[FIXED] How scrape items from 2 different sections?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels