Issue
I have several Scrapy spiders inside my spiders' directory ( let's suppose 50 spiders), now I want to run them sequential (not concurrent)
I could run them concurrent with the following code but because of some policy I've decided to run them sequentially ,
start=datetime.now()
setting = get_project_settings()
process = CrawlerProcess(setting)
for spider_name in process.spiders.list():
print ("Running spider %s" % (spider_name))
print(spider_name)
process.crawl(spider_name) #query dvh is custom argument used in your scrapy
process.start()
print("***********Execution time : {0}".format((datetime.now()-start)))
Also, I tried
for spider_name in process.spiders.list():
print ("Running spider %s" % (spider_name))
os.system("pwd ")
os.system("pwd && scrapy crawl " + spider_name) //pwd to make sure it's in correct path and I see it is
but it seems cant run by os.system
, another solution is using a .sh
but I'm not sure it's a good idea.
I'm looking for a solution to run spiders sequentially?
Solution
The Scrapy documentation has a section explaining how to run multiple spiders in the same process and also how to do this sequentially.
For your case, this could look like this:
from datetime import datetime
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
start = datetime.now()
settings = get_project_settings()
configure_logging(settings)
runner = CrawlerRunner(settings)
@defer.inlineCallbacks
def crawl():
for spider_name in runner.spider_loader.list():
print("Running spider %s" % (spider_name))
yield runner.crawl(spider_name)
reactor.stop()
crawl()
reactor.run()
print("***********Execution time : {0}".format((datetime.now()-start)))
This only works if Spider Loader is properly configured. Otherwise, you can also simple list all your spiders in the crawl method one by one.
Answered By - Oliver Sauder
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.