Issue
I am using scrapy
to crawl this page
but for some reason scrapy
cannot receive a response from this website.
when i run the crawler I receive https 500 error
here is my basic spider
import scrapy
class SavingsGov(scrapy.Spider):
name = 'savings'
start_urls = [
'https://savings.gov.pk/download-draws/'
]
def parse(self, response):
for option in response.css('select option'):
yield {
'url': option.css('::attr(value)').get()
}
and here are the errors I get when I run it, (I have also increased the number of retries to 10 in settings.py
)
2023-08-26 16:30:22 [scrapy.core.engine] INFO: Spider opened
2023-08-26 16:30:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-08-26 16:30:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-08-26 16:30:24 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 1 times): 500 Internal Server Error
2023-08-26 16:30:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 2 times): 500 Internal Server Error
2023-08-26 16:30:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 3 times): 500 Internal Server Error
2023-08-26 16:30:28 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 4 times): 500 Internal Server Error
2023-08-26 16:30:30 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 5 times): 500 Internal Server Error
2023-08-26 16:30:31 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 6 times): 500 Internal Server Error
2023-08-26 16:30:33 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 7 times): 500 Internal Server Error
2023-08-26 16:30:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 8 times): 500 Internal Server Error
2023-08-26 16:30:37 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 9 times): 500 Internal Server Error
2023-08-26 16:30:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 10 times): 500 Internal Server Error
2023-08-26 16:30:40 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://savings.gov.pk/robots.txt> (failed 11 times): 500 Internal Server Error
2023-08-26 16:30:40 [scrapy.core.engine] DEBUG: Crawled (500) <GET https://savings.gov.pk/robots.txt> (referer: None)
2023-08-26 16:30:40 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2023-08-26 16:30:41 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 1 times): 500 Internal Server Error
2023-08-26 16:30:43 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 2 times): 500 Internal Server Error
2023-08-26 16:30:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 3 times): 500 Internal Server Error
2023-08-26 16:30:46 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 4 times): 500 Internal Server Error
2023-08-26 16:30:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 5 times): 500 Internal Server Error
2023-08-26 16:30:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 6 times): 500 Internal Server Error
2023-08-26 16:30:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 7 times): 500 Internal Server Error
2023-08-26 16:30:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 8 times): 500 Internal Server Error
2023-08-26 16:30:53 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 9 times): 500 Internal Server Error
2023-08-26 16:30:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 10 times): 500 Internal Server Error
2023-08-26 16:30:56 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://savings.gov.pk/download-draws/> (failed 11 times): 500 Internal Server Error
2023-08-26 16:30:56 [scrapy.core.engine] DEBUG: Crawled (500) <GET https://savings.gov.pk/download-draws/> (referer: None)
2023-08-26 16:30:56 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://savings.gov.pk/download-draws/>: HTTP status code is not handled or not allowed
2023-08-26 16:30:56 [scrapy.core.engine] INFO: Closing spider (finished)
but I can easily get a response using python's requests
module.
here is the python code for that
import requests
response = requests.get('https://savings.gov.pk/download-draws/')
print(response.text)
I don't know why this is happening, I am assuming that the problem is with scrapy.Request
.
is there any way to perform requests with requests
and pass the response to scrapy
? but the preferable option would be to somehow debug scrapy.Request
I am new to scrapy
so if there is a possibility that I'm misunderstanding the problem, please let me know.
Solution
It is most likely because the server probably rejects requests from scrapy default user agent.
Try setting a custom one in the spiders custom settings. Also set ROBOTSTXT_OBEY to false.
For example:
import scrapy
class SavingsGov(scrapy.Spider):
name = 'savings'
start_urls = [
'https://savings.gov.pk/download-draws/'
]
custom_settings = {
"USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36",
"ROBOTSTXT_OBEY": False
}
def parse(self, response):
for option in response.css('select option'):
yield {
'url': option.css('::attr(value)').get()
}
Partial output:
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-1500-draw-list/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-200-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-1500-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-25000-premium-bonds-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-15000-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-40000-premium-bonds-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-40000-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-25000-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-7500-draws/'}
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.