Issue
I am trying to scrape urls of the form :
https://in.bookmyshow.com/XXXXX/cinemas : where 'XXXXX' is any city name. I have around 880 different city names in a file. And I want to scrap data from each of the url.
My Sample Code is as follows : https://www.jdoodle.com/a/u1E
File from which data is read is as follows : https://www.jdoodle.com/a/u1G
The problem that I am facing is that whenever I try to run scrapy using default settings, it runs asynchronously and concurrently. However, in doing so, it misses on half the urls that are to be scraped.
Also, if I run scrapy using settings mentioned in option 2 : (see below). It scraps all the urls but this time it takes insane amount of time for completion.
Isn't there any way in which I can still run my script concurrently without losing any data to be scraped.
Option 1:
Settings : Default
Stats :
'downloader/request_count': 1331
item_scraped_count': 444
Time to complete : 9 min
Option 2:
Settings : {'AUTOTHROTTLE_ENABLED': True, 'CONCURRENT_REQUESTS': 1, 'DOWNLOAD_DELAY': 3}
Stats :
'downloader/request_count': 1772
item_scraped_count': 878
Time to complete : 1hr-45min
Solution
Seems like your issue is not delay but concurrency. When you have 1 concurrent request per minute it works, when you have more it doesn't. Most likely reason of this is that the website is serving you content based on your cookies.
Try disabling cookies via COOKIES_ENABLED
settings in your settings.py file:
COOKIES_ENABLED = False
If you see that content is not being served without cookies at all you need to use cookiejars to start multiple cookie instances that are working in parallel.
Answered By - Granitosaurus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.