Sunday, January 30, 2022

[FIXED] I want to extract the numbers at the end of a url using regular expressions in scrapy

January 30, 2022 scrapy No comments

Issue

I want to get '25430989' from the end of this url.

https://www.example.com/cars-for-sale/2007-ford-focus-1-6-diesel/25430989

How would I write it using the xpath?

I get the link using this xpath: link = row.xpath('.//a/@href').get()

When I use a regex tester I can isolate it with r'(\d+)$ but when I put it into my code it doesn't work for some reason.

import scrapy
import re
from ..items import DonedealItem

class FarmtoolsSpider(scrapy.Spider):
    name = 'farmtools'
    allowed_domains = ['www.donedeal.ie']
    start_urls = ['https://www.donedeal.ie/all?source=private&sort=publishdate%20desc']

    def parse(self, response):
        items = DonedealItem()
        rows = response.xpath('//ul[@class="card-collection"]/li')

        for row in rows:
            if row.xpath('.//ul[@class="card__body-keyinfo"]/li[contains(text(),"0 min")]/text()'):

                link = row.xpath('.//a/@href').get() #this is the full link.
                linkid = link.re(r'(\d+)$).get()
                title = row.xpath('.//p[@class="card__body-title"]/text()').get()
                county = row.xpath('.//li[contains(text(),"min")]/following-sibling::node()/text()').get()
                price = row.xpath('.//p[@class="card__price"]/span[1]/text()').get()
                subcat = row.xpath('.//a/div/div[2]/div[1]/p[2]/text()[2]').get()

                items['link'] = link
                items['linkid'] = linkid
                items['title'] = title
                items['county'] = county
                items['price'] = price
                items['subcat'] = subcat

                yield items

I'm trying to get the linkid.

Solution

The problem is here

            link = row.xpath('.//a/@href').get() #this is the full link.
            linkid = link.re(r'(\d+)$).get()

When you use the .get() method it returns a string that is saved in the link variable, and strings don't have a .re() method for you to call. You can use one of the methods from the re module (docs for reference).

I would use re.findall(), it will return you a list of values that matches the regex (in this case only one item would return), or None if nothing matches. re.search() is also a good choice, but will return you an re.Match object.

import re #Don't forget to import it

            ...
            link = row.xpath('.//a/@href').get()
            linkid = re.findall(r'(\d+)$', link)

Now, the Scrapy selectors also support regex, so an alternative would be implementing it like this: (No need for re module)

            linkid = row.xpath('.//a/@href').re_first(r'(\d+)$')

Notice I didn't use .get() there.

Answered By - renatodvc

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 30, 2022

[FIXED] I want to extract the numbers at the end of a url using regular expressions in scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels