Issue
I'm trying to scrape a subreddit using Scrapy however, I keep getting 404 error every time I run the spider.
2020-01-07 12:21:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://www.reddit.com/r/gameofthrones//>: HTTP status code is not handled or not allowed
The code I am currently using:
import scrapy
class RedditbotSpider(scrapy.Spider):
name = 'redditbot'
allowed_domains = ['www.reddit.com/r/gameofthrones/']
start_urls = ['http://www.reddit.com/r/gameofthrones//']
def parse(self, response):
#Extracting the content using css selectors
titles = response.css('.title.may-blank::text').extract()
votes = response.css('.score.unvoted::text').extract()
times = response.css('time::attr(title)').extract()
comments = response.css('.comments::text').extract()
#Give the extracted content row wise
for item in zip(titles,votes,times,comments):
#create a dictionary to store the scraped info
scraped_info = {
'title' : item[0],
'vote' : item[1],
'created_at' : item[2],
'comments' : item[3],
}
#yield or give the scraped info to scrapy
yield scraped_info
I have tried rerunning after changing the USER_AGENT in the settings.py file however I have the same issue.
Solution
Check you URL ... http://www.reddit.com/r/gameofthrones//
(<- double slash) as you wrote as your start url does not exist and throws a 404 error.
Answered By - Bastian
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.