Issue
I have simple scrapy spider.
import scrapy
from scrapy.crawler import CrawlerProcess
class ScraperSpider(scrapy.Spider):
name = "scraper"
def start_requests(self):
urls = [
'https://api.ipify.org?format=json',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
self.logger.info('================Request: %s, IP address: %s' % (response.request, response.text))
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(ScraperSpider)
process.start()
However, it gives an error:
2023-12-18 23:56:34 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://api.ipify.org?format=json> (referer: None)
2023-12-18 23:56:34 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://api.ipify.org?format=json>: HTTP status code is not handled or not allowed
2023-12-18 23:56:34 [scrapy.core.engine] INFO: Closing spider (finished)
But actually the url can be fetched with curl or browser.
Solution
Add a /
before ?
in the url:
import scrapy
class ScraperSpider(scrapy.Spider):
name = "scraper"
def start_requests(self):
urls = [
'https://api.ipify.org/?format=json',
]
for url in urls:
yield scrapy.Request(url=url)
def parse(self, response):
self.logger.info('================Request: %s, IP address: %s' % (response.request, response.json().get('ip')))
Output:
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.ipify.org/?format=json> (referer: None)
[scraper] INFO: ================Request: <GET https://api.ipify.org/?format=json>, IP address: X.X.X.X
Answered By - SuperUser
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.