Wednesday, August 24, 2022

[FIXED] Scrapy request does not trigger callback

August 24, 2022 python, scrapy, web-crawler, web-scraping No comments

Issue

Scrapy request does not trigger callback. The '1' never been print. By researching for a long time, still can't solve. It can't fire callback on any different url.

in default_settings.py, ROBOTSTXT_OBEY = False specified. Also dont_filter=True.

import scrapy as scrapy    
class TheSpider(scrapy.Spider):
    name = 'Test'
    headers = {
        'Accept': '*/*',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
        'Connection': 'keep-alive',
        'Host': 'www.eventscribe.com',
        'Referer': 'https://www.eventscribe.com/2018/ADEA/speakers.asp?h=Browse%20By%20Speaker',
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
        'X-Requested-With': 'XMLHttpRequest'
    }
    payload = {'as_epq': 'James Clark', 'tbs': 'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm': 'nws'}

    def run(self):
        scrapy.Request(url='https://www.google.com/',
                              callback=self.parse, method='GET', headers=self.headers,
                              dont_filter=True)

    def parse(self, response, **kwargs):
        print('1')
        self.log("I just visited:" + response.url)
        scrapy.FormRequest.from_response(response, formdata={'startDate': '08.29.2021'},
                                         clickdata={'id': 'calendar-picker-submit'},
                                         method='POST',
                                         callback=self.new_response, headers=self.headers,
                                         dont_filter=True)

    def new_response(self, response):
        self.log("I just visited:" + response.url)
        response.xpath("//div[@class='row numbers-past-results']/div[@class='ball-number']/text()").extract()


theSpider = TheSpider(scrapy.Spider)
theSpider.run()

Anyone can help? Thanks in advance.

Solution

There are a few issues that need to be resolved to use scrapy. I'm going to assume your intention is to run the file as a script and not use the scrapy CLI. Below are some of the problems with your code and possible solutions, but it really seems like you should read the quickstart section of the scrapy docs as well. https://docs.scrapy.org/

You need to import the crawler process if you want to have a self contained script and standalone spider.
Also the entry point for the spider crawl is the start_requests method, not run.
Another issue is none of your methods are yielding the requests.
Also there is something about your headers that is being rejected, since I am assuming you are using those headers for a reason, I am not going to modify them, instead I just won't use them.

With those few changes you can now see the 1 printed to the screen when the parse callback is called.

import scrapy as scrapy    
from scrapy.crawler import CrawlerProcess
class TheSpider(scrapy.Spider):
    name = 'Test'
    headers = {
        'Accept': '*/*',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
        'Connection': 'keep-alive',
        'Host': 'www.eventscribe.com',
        'Referer': 'https://www.eventscribe.com/2018/ADEA/speakers.asp?h=Browse%20By%20Speaker',
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
        'X-Requested-With': 'XMLHttpRequest'
    }
    payload = {'as_epq': 'James Clark', 'tbs': 'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm': 'nws'}

    def start_requests(self):
        yield scrapy.Request(url='https://www.google.com')

    def parse(self, response, **kwargs):
        print('1')
        yield scrapy.FormRequest.from_response(response, formdata={'startDate': '08.29.2021'},
                                         clickdata={'id': 'calendar-picker-submit'},
                                         method='POST',
                                         callback=self.new_response, headers=self.headers,
                                         dont_filter=True)

    def new_response(self, response):
        self.log("I just visited:" + response.url)
        response.xpath("//div[@class='row numbers-past-results']/div[@class='ball-number']/text()").extract()


process = CrawlerProcess()
process.crawl(TheSpider)
process.start()

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, August 24, 2022

[FIXED] Scrapy request does not trigger callback

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels