Wednesday, June 15, 2022

[FIXED] How to get href from a class containing a specific text using CSS selector (Scrapy)

June 15, 2022 css, python, regex, scrapy, web-scraping No comments

Issue

I am working with the following web site: https://inmuebles.mercadolibre.com.mx/venta/, and I am trying to get the link from "ver_todos" button from "Inmueble" section (in red). However, the "Tour virtual" and "Publicados hoy" sections (in blue) may or may not appear when visiting the site.

As shown in the image below, the classes ui-search-filter-dl contain the specific sections from the menu from above image; while ui-search-filter-container classes contain the sub-sections displayed by the site (e.g. Casas, Departamento & Terrenos for Inmueble). With the intention of obtaining the link from "ver todos" button from "Inmueble" section, I was using this line of code:

ver_todos = response.css('div.ui-search-filter-dl')[2].css('a.ui-search-modal__link').attrib['href']

But since "Tour virtual" and "Publicados hoy" are not always in the page, I cannot be sure that ui-search-filter-dl at index 2 is always the index corresponding to "ver todos" button.

I was trying to get the link from "ver todos" by using this line of code:

response.css(''':contains("Inmueble") ~ .ui-search-filter-dt-title
                            .ui-search-modal__link::attr(href)''').extract()

Basically, I was trying to get the href from a ui-search-filter-dt-title class that contains the title "Inmueble". Unfortunately, the output is an empty list. I would like to find the link from "ver todos" by using css and regex but I'm having trouble with it. How may I achieve that?

Solution

I think xpath is easier to select the target elements in most cases:

Code:

xpath = "//div[contains(text(), 'Inmueble')]/following-sibling::ul//a[contains(@class,'ui-search-modal__link')]/@href"
url = response.xpath(xpath).extract()[0]

Actually, I didn't create a scrapy project to check your code. Alternatively, I implemented the following code:

from lxml import html
import requests

res = requests.get( "https://inmuebles.mercadolibre.com.mx/venta/")

dom = html.fromstring(res.text)

xpath = "//div[contains(text(), 'Inmueble')]/following-sibling::ul//a[contains(@class,'ui-search-modal__link')]/@href"
url = dom.xpath(xpath)[0]

assert url == 'https://inmuebles.mercadolibre.com.mx/venta/_FiltersAvailableSidebar?filter=PROPERTY_TYPE'

Since the xpath should be the same among scrapy and lxml, of course, I hope the code shown in the beginning will also work fine in your scrapy project.

Answered By - quasi-human

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, June 15, 2022

[FIXED] How to get href from a class containing a specific text using CSS selector (Scrapy)

Issue

Solution

Code:

0 comments:

Post a Comment

Popular Posts

Labels