Issue
My Scrapy code doesn't work. I'm trying to do scraping of the forum but receiving an error. Here is my code:
import scrapy, time
class ForumSpiderSpider(scrapy.Spider):
name = 'forum_spider'
allowed_domains = ['visforvoltage.org/latest_tech/']
start_urls = ['http://visforvoltage.org/latest_tech//']
def parse_urls(self, response):
for href in response.css(r"tbody a[href*='/forum/']::attr(href)").extract():
url = response.urljoin(href)
print(url)
req = scrapy.Request(url, callback=self.parse_data)
time.sleep(10)
yield req
def parse_data(self, response):
for sel in response.css('html').extract():
data = {}
data['name'] = response.css(r"div[class='author-pane-line author-name'] span[class='username']::text").extract()
data['date'] = response.css(r"div[class='forum-posted-on']:contains('-') ::text").extract()
data['title'] = response.css(r"div[class='section'] h1[class='title']::text").extract()
data['body'] = response.css(r"div[class='field-items'] p::text").extract()
yield data
next_page = response.css(r"li[class='pager-next'] a[href*='page=']::attr(href)").extract()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse_urls)
Here is an error:
[scrapy.core.scraper] ERROR: Spider error processing <GET https://visforvoltage.org/latest_tech> (referer: None)
raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))
NotImplementedError: ForumSpiderSpider.parse callback is not defined
I will really appreciate if somebody can help me with it!
Solution
The parent class scrapy.Spider
has a method called start_requests
. That is the method that will check your start_urls
and create the first requests for the spider.
That method expects you to have a method called parse
to work as a callback function. So the quickest way to solve the problem is changing your parse_urls
method to parse
, like this:
def parse(self, response):
for href in response.css(r"tbody a[href*='/forum/']::attr(href)").extract():
url = response.urljoin(href)
print(url)
req = scrapy.Request(url, callback=self.parse_data)
time.sleep(10)
yield req
If you want to change that behavior, you need to overwrite the start_requests
method in your class, so you can determine the name of the callback function. For example:
def start_requests(self):
for url in self.start_urls:
yield Request(url, callback=self.parse_urls, dont_filter=True)
Answered By - renatodvc
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.