Issue
I have a list of start 2000 urls and I'm using:
DOWNLOAD_DELAY = 0.25
For controlling the speed of the requests, But I also want to add a bigger delay after n requests. For example, I want a delay of 0.25 seconds for each request and a delay of 100 seconds each 500 requests.
Edit:
Sample code:
import os
from os.path import join
import scrapy
import time
date = time.strftime("%d/%m/%Y").replace('/','_')
list_of_pages = {'http://www.lapatilla.com/site/':'la_patilla',
'http://runrun.es/':'runrunes',
'http://www.noticierodigital.com/':'noticiero_digital',
'http://www.eluniversal.com/':'el_universal',
'http://www.el-nacional.com/':'el_nacional',
'http://globovision.com/':'globovision',
'http://www.talcualdigital.com/':'talcualdigital',
'http://www.maduradas.com/':'maduradas',
'http://laiguana.tv/':'laiguana',
'http://www.aporrea.org/':'aporrea'}
root_dir = os.getcwd()
output_dir = join(root_dir,'data/',date)
class TestSpider(scrapy.Spider):
name = "news_spider"
download_delay = 1
start_urls = list_of_pages.keys()
def parse(self, response):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
filename = list_of_pages[response.url]
print time.time()
with open(join(output_dir,filename), 'wb') as f:
f.write(response.body)
The list, in this case, is shorter yet the idea is the same. I want to have to levels of delays one for each request and one each 'N' requests. I'm not crawling the links, just saving the main page.
Solution
You can look into using an AutoThrottle extension which does not give you a tight control of the delays but instead has its own algorithm of slowing down the spider adjusting it on the fly depending on the response time and number of concurrent requests.
If you need more control over the delays at certain stages of the scraping process, you might need a custom middleware or a custom extension (similar to AutoThrottle - source).
You can also change the .download_delay
attribute of your spider on the fly. By the way, this is exactly what AutoThrottle extension does under-the-hood - it updates the .download_delay
value on the fly.
Some related topics:
Answered By - alecxe
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.