Issue
I have the following task: in the DB we have ~2k URLs. for each URL we need to run spider until all URLs will be processed. I was running spider for a bunch of URLs (10 in one run)
I have used the following code:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
URLs = crawler_table.find(crawl_timestamp=None)
settings = get_project_settings()
for i in range(len(URLs) // 10):
process = CrawlerProcess(settings)
limit = 10
kount = 0
for crawl in crawler_table.find(crawl_timestamp=None):
if kount < limit:
kount += 1
process.crawl(
MySpider,
start_urls=[crawl['crawl_url']]
)
process = CrawlerProcess(settings)
process.start()
but it is running only for the first loop. for the second I have the error:
File "C:\Program Files\Python310\lib\site-packages\scrapy\crawler.py", line 327, in start
reactor.run(installSignalHandlers=False) # blocking call
File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1314, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1296, in startRunning
ReactorBase.startRunning(cast(ReactorBase, self))
File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 840, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
is there any solution to avoid this error? and run spider for all 2k URLs?
Solution
This is because you can't start twisted reactor in the same process twice. you can use multiprocessing and launch each batch in separate process. your code may look like this:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import multiprocessing as mp
def start_crawlers(urls_batchs, limit = 10):
settings = get_project_settings()
process = CrawlerProcess(settings)
kount = 0
for batch in urls_batchs:
if kount < limit:
kount += 1
process.crawl(
MySpider,
start_urls=[batch]
)
process.start()
if __name__ == "__main__":
URLs = ...
for urls_batchs in URLs:
process = mp.Process(target=start_crawlers, args=(urls_batchs,))
process.start()
process.join()
Answered By - zaki98
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.