Issue
Scrapy request does not trigger callback. The '1' never been print. By researching for a long time, still can't solve. It can't fire callback on any different url.
in default_settings.py, ROBOTSTXT_OBEY = False
specified. Also dont_filter=True
.
import scrapy as scrapy
class TheSpider(scrapy.Spider):
name = 'Test'
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
'Host': 'www.eventscribe.com',
'Referer': 'https://www.eventscribe.com/2018/ADEA/speakers.asp?h=Browse%20By%20Speaker',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}
payload = {'as_epq': 'James Clark', 'tbs': 'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm': 'nws'}
def run(self):
scrapy.Request(url='https://www.google.com/',
callback=self.parse, method='GET', headers=self.headers,
dont_filter=True)
def parse(self, response, **kwargs):
print('1')
self.log("I just visited:" + response.url)
scrapy.FormRequest.from_response(response, formdata={'startDate': '08.29.2021'},
clickdata={'id': 'calendar-picker-submit'},
method='POST',
callback=self.new_response, headers=self.headers,
dont_filter=True)
def new_response(self, response):
self.log("I just visited:" + response.url)
response.xpath("//div[@class='row numbers-past-results']/div[@class='ball-number']/text()").extract()
theSpider = TheSpider(scrapy.Spider)
theSpider.run()
Anyone can help? Thanks in advance.
Solution
There are a few issues that need to be resolved to use scrapy. I'm going to assume your intention is to run the file as a script and not use the scrapy CLI. Below are some of the problems with your code and possible solutions, but it really seems like you should read the quickstart section of the scrapy docs as well. https://docs.scrapy.org/
- You need to import the crawler process if you want to have a self contained script and standalone spider.
- Also the entry point for the spider crawl is the
start_requests
method, notrun
. - Another issue is none of your methods are yielding the requests.
- Also there is something about your headers that is being rejected, since I am assuming you are using those headers for a reason, I am not going to modify them, instead I just won't use them.
With those few changes you can now see the 1
printed to the screen when the parse
callback is called.
import scrapy as scrapy
from scrapy.crawler import CrawlerProcess
class TheSpider(scrapy.Spider):
name = 'Test'
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
'Host': 'www.eventscribe.com',
'Referer': 'https://www.eventscribe.com/2018/ADEA/speakers.asp?h=Browse%20By%20Speaker',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}
payload = {'as_epq': 'James Clark', 'tbs': 'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm': 'nws'}
def start_requests(self):
yield scrapy.Request(url='https://www.google.com')
def parse(self, response, **kwargs):
print('1')
yield scrapy.FormRequest.from_response(response, formdata={'startDate': '08.29.2021'},
clickdata={'id': 'calendar-picker-submit'},
method='POST',
callback=self.new_response, headers=self.headers,
dont_filter=True)
def new_response(self, response):
self.log("I just visited:" + response.url)
response.xpath("//div[@class='row numbers-past-results']/div[@class='ball-number']/text()").extract()
process = CrawlerProcess()
process.crawl(TheSpider)
process.start()
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.