Friday, December 29, 2023

[FIXED] scrapy run thousands of instance of the same spider

December 29, 2023 python, scrapy, twisted No comments

Issue

I have the following task: in the DB we have ~2k URLs. for each URL we need to run spider until all URLs will be processed. I was running spider for a bunch of URLs (10 in one run)

I have used the following code:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

URLs = crawler_table.find(crawl_timestamp=None)
settings = get_project_settings()
for i in range(len(URLs) // 10):
    process = CrawlerProcess(settings)

    limit = 10
    kount = 0

    for crawl in crawler_table.find(crawl_timestamp=None):
        if kount < limit:
            kount += 1
            process.crawl(
                MySpider,
                start_urls=[crawl['crawl_url']]
           )
    process = CrawlerProcess(settings)
    process.start()

but it is running only for the first loop. for the second I have the error:

  File "C:\Program Files\Python310\lib\site-packages\scrapy\crawler.py", line 327, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1314, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 1296, in startRunning
    ReactorBase.startRunning(cast(ReactorBase, self))
  File "C:\Program Files\Python310\lib\site-packages\twisted\internet\base.py", line 840, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

is there any solution to avoid this error? and run spider for all 2k URLs?

Solution

This is because you can't start twisted reactor in the same process twice. you can use multiprocessing and launch each batch in separate process. your code may look like this:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import multiprocessing as mp

def start_crawlers(urls_batchs, limit = 10):
    settings = get_project_settings()
    process = CrawlerProcess(settings)

    kount = 0

    for batch in urls_batchs:
        if kount < limit:
            kount += 1
            process.crawl(
                MySpider,
                start_urls=[batch]
           )
    process.start()
if __name__ == "__main__":
    URLs = ...
    for urls_batchs in URLs:
        process = mp.Process(target=start_crawlers, args=(urls_batchs,))
        process.start()
        process.join()

Answered By - zaki98

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 29, 2023

[FIXED] scrapy run thousands of instance of the same spider

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels