Wednesday, January 12, 2022

[FIXED] How to crawl a site from a specific date on?

January 12, 2022 python, scrapy No comments

Issue

I would like to write a web crawler for a news page, which searches for all links and then checks in the links first, if the date is for example greater than June 25, 2020. And if that is positive, then it extracts all the desired data from the page.

I know how to extract everything, I just can't get the function with the date check included.

Can someone please help me?

I've written this so far... Everything works without the date part.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from datetime import datetime

class TsDateSpider(CrawlSpider):
    name = 'ts_date'
    allowed_domains = ['tagesschau.de']
    start_urls = ['https://tagesschau.de/']

    rules = (Rule(LinkExtractor(allow=('inland/'), deny=('magnifier'),), callback='parse_article', follow=True),)

    def parse_article(self, response):

        print('Got a response from %s.' % response.url)

        complete_article = response.xpath('//div[@class="storywrapper"]')
        for article in complete_article:

            start_date = datetime(2020, 6, 25)
            article_date = article.xpath('//meta[@name="date"]/@content')[0].get()
            article_dt = datetime.strptime(article_date, "%Y, %m, %d")

            print(article_dt)

            if start_date <= article_dt:
                yield request(callback=self.parse)


    def parse(self):
            title = article.xpath('//div[@class="meldungHead"]/h1/span[@class="dachzeile"]/text()').get()
            print(title)

Thank you in advance, Philipp

Solution

You need to make following changes:

change datatime format string
yield item from your method.

    def parse_article(self, response):

        print('Got a response from %s.' % response.url)

        complete_article = response.xpath('//div[@class="storywrapper"]')
        for article in complete_article:

            start_date = datetime(2020, 6, 25)
            article_date = article.xpath('//meta[@name="date"]/@content')[0].get()
            article_dt = datetime.strptime(article_date, "%Y-%m-%dT%H:%M:%S") # here 

            print(article_dt)

            if start_date <= article_dt:
                # and yield you item here (parse method not required if you used own method parse_article)
                title = article.xpath('//div[@class="meldungHead"]/h1/span[@class="dachzeile"]/text()').get()
                yield {'title': title}

Answered By - Roman

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 12, 2022

[FIXED] How to crawl a site from a specific date on?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels