Issue
I'm new to scrapy and I made the scrapy project to scrap data.
I'm trying to scrapy the data from the website but I'm getting following error logs
2016-08-29 14:07:57 [scrapy] INFO: Enabled item pipelines:
[]
2016-08-29 13:55:03 [scrapy] INFO: Spider opened
2016-08-29 13:55:03 [scrapy] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)
2016-08-29 13:55:04 [scrapy] DEBUG: Crawled (403) <GET http://www.justdial.com/robots.txt> (referer: None)
2016-08-29 13:55:04 [scrapy] DEBUG: Crawled (403) <GET http://www.justdial.com/Mumbai/small-business> (referer: None)
2016-08-29 13:55:04 [scrapy] DEBUG: Ignoring response <403 http://www.justdial.com/Mumbai/small-business>: HTTP status code is not handled or not allowed
2016-08-29 13:55:04 [scrapy] INFO: Closing spider (finished)
I'm trying following command then on website console then I got the response but when I'm using same path inside python script then I got the error which I have described above.
Commands on web console:
$x('//div[@class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/h4/span/a/text()')
$x('//div[@class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/p[@class="contact-info"]/span/a/text()')
Please help me.
Thanks
Solution
Like Avihoo Mamka mentioned in the comment you need to provide some extra request headers to not get rejected by this website.
In this case it seems to just be the User-Agent
header. By default scrapy identifies itself with user agent "Scrapy/{version}(+http://scrapy.org)"
. Some websites might reject this for one reason or another.
To avoid this just set headers
parameter of your Request
with a common user agent string:
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
yield Request(url, headers=headers)
You can find a huge list of user-agents here, though you should stick with popular web-browser ones like Firefox, Chrome etc. for the best results
You can implement it to work with your spiders start_urls
too:
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = (
'http://scrapy.org',
)
def start_requests(self):
headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
for url in self.start_urls:
yield Request(url, headers=headers)
Answered By - Granitosaurus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.