Issue
In CrawlSpider, how can I scrape the marked field "4 days ago" in the image before extracting each link? The below-mentioned CrawlSpider is working fine. But in 'parse_item' I want to add a new field named 'Add posted' where I want to get the field marked on the image.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class PropertySpider(CrawlSpider):
name = 'property'
allowed_domains = ['www.openrent.co.uk']
start_urls = [
'https://www.openrent.co.uk/properties-to-rent/london?term=London&skip='+ str(x) for x in range(0, 5, 20)
]
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[@id='property-data']/a"), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield {
'Title': response.xpath("//h1[@class='property-title']/text()").get(),
'Price': response.xpath("//h3[@class='perMonthPrice price-title']/text()").get(),
'Links': response.url,
'Add posted': ?
}
Solution
When using the Rule
object of the scrapy crawl spider, the extracted link's text is saved in a meta field of the request named link_text
. You can obtain this value in the parse_item
method and extract the time information using regex. You can read more about it from the docs. See below example.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re
class PropertySpider(CrawlSpider):
name = 'property'
allowed_domains = ['www.openrent.co.uk']
start_urls = [
'https://www.openrent.co.uk/properties-to-rent/london?term=London&skip='+ str(x) for x in range(0, 5, 20)
]
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[@id='property-data']/a"), callback='parse_item', follow=True),
)
def parse_item(self, response):
link_text = response.request.meta.get("link_text")
m = re.search(r"(Last Updated.*ago)", link_text)
if m:
posted = m.group(1).replace("\xa0", " ")
yield {
'Title': response.xpath("//h1[@class='property-title']/text()").get(),
'Price': response.xpath("//h3[@class='perMonthPrice price-title']/text()").get(),
'Links': response.url,
"Add posted": posted
}
Answered By - msenior_
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.