Issue
I want to get '25430989' from the end of this url.
https://www.example.com/cars-for-sale/2007-ford-focus-1-6-diesel/25430989
How would I write it using the xpath?
I get the link using this xpath: link = row.xpath('.//a/@href').get()
When I use a regex tester I can isolate it with r'(\d+)$ but when I put it into my code it doesn't work for some reason.
import scrapy
import re
from ..items import DonedealItem
class FarmtoolsSpider(scrapy.Spider):
name = 'farmtools'
allowed_domains = ['www.donedeal.ie']
start_urls = ['https://www.donedeal.ie/all?source=private&sort=publishdate%20desc']
def parse(self, response):
items = DonedealItem()
rows = response.xpath('//ul[@class="card-collection"]/li')
for row in rows:
if row.xpath('.//ul[@class="card__body-keyinfo"]/li[contains(text(),"0 min")]/text()'):
link = row.xpath('.//a/@href').get() #this is the full link.
linkid = link.re(r'(\d+)$).get()
title = row.xpath('.//p[@class="card__body-title"]/text()').get()
county = row.xpath('.//li[contains(text(),"min")]/following-sibling::node()/text()').get()
price = row.xpath('.//p[@class="card__price"]/span[1]/text()').get()
subcat = row.xpath('.//a/div/div[2]/div[1]/p[2]/text()[2]').get()
items['link'] = link
items['linkid'] = linkid
items['title'] = title
items['county'] = county
items['price'] = price
items['subcat'] = subcat
yield items
I'm trying to get the linkid.
Solution
The problem is here
link = row.xpath('.//a/@href').get() #this is the full link.
linkid = link.re(r'(\d+)$).get()
When you use the .get()
method it returns a string that is saved in the link
variable, and strings don't have a .re()
method for you to call. You can use one of the methods from the re module (docs for reference).
I would use re.findall()
, it will return you a list of values that matches the regex (in this case only one item would return), or None
if nothing matches. re.search()
is also a good choice, but will return you an re.Match
object.
import re #Don't forget to import it
...
link = row.xpath('.//a/@href').get()
linkid = re.findall(r'(\d+)$', link)
Now, the Scrapy selectors also support regex, so an alternative would be implementing it like this: (No need for re module)
linkid = row.xpath('.//a/@href').re_first(r'(\d+)$')
Notice I didn't use .get()
there.
Answered By - renatodvc
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.