Issue
I would like to write a web crawler for a news page, which searches for all links and then checks in the links first, if the date is for example greater than June 25, 2020. And if that is positive, then it extracts all the desired data from the page.
I know how to extract everything, I just can't get the function with the date check included.
Can someone please help me?
I've written this so far... Everything works without the date part.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from datetime import datetime
class TsDateSpider(CrawlSpider):
name = 'ts_date'
allowed_domains = ['tagesschau.de']
start_urls = ['https://tagesschau.de/']
rules = (Rule(LinkExtractor(allow=('inland/'), deny=('magnifier'),), callback='parse_article', follow=True),)
def parse_article(self, response):
print('Got a response from %s.' % response.url)
complete_article = response.xpath('//div[@class="storywrapper"]')
for article in complete_article:
start_date = datetime(2020, 6, 25)
article_date = article.xpath('//meta[@name="date"]/@content')[0].get()
article_dt = datetime.strptime(article_date, "%Y, %m, %d")
print(article_dt)
if start_date <= article_dt:
yield request(callback=self.parse)
def parse(self):
title = article.xpath('//div[@class="meldungHead"]/h1/span[@class="dachzeile"]/text()').get()
print(title)
Thank you in advance, Philipp
Solution
You need to make following changes:
- change datatime format string
- yield item from your method.
def parse_article(self, response):
print('Got a response from %s.' % response.url)
complete_article = response.xpath('//div[@class="storywrapper"]')
for article in complete_article:
start_date = datetime(2020, 6, 25)
article_date = article.xpath('//meta[@name="date"]/@content')[0].get()
article_dt = datetime.strptime(article_date, "%Y-%m-%dT%H:%M:%S") # here
print(article_dt)
if start_date <= article_dt:
# and yield you item here (parse method not required if you used own method parse_article)
title = article.xpath('//div[@class="meldungHead"]/h1/span[@class="dachzeile"]/text()').get()
yield {'title': title}
Answered By - Roman
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.