Monday, August 15, 2022

[FIXED] First Scrapy crawler works, subsequent crawlers in sequence fail

August 15, 2022 python-3.x, scrapy No comments

Issue

I have a script setup like below:

try:
    from Xinhua import Xinhua
except:
    error_message("Xinhua")

try:
    from China_Daily import China_Daily
except:
    error_message("China Daily")

try:
    from Global_Times import Global_Times
except:
    error_message("Global Times")

try:
    from Peoples_Daily import Peoples_Daily
except:
    error_message("People's Daily")

The purpose is to run a Scrapy crawler for each site, process the results, and upload those results to a database. When I run each script individually, each portion works fine. When I run from the block of code I've outlined, however, only the first Scrapy crawler actually works properly. All of the subsequent ones attempt to access the sites they are supposed to but don't return any results. I don't even get proper error messages back, just some "DEBUG... 200 None" and "[scrapy.crawler] INFO: Overridden settings: {}" lines. I also don't think the issue is my IP or anything being blocked; as soon as the crawlers fail I immediately launch them individually and they work great.

My guess is that the first crawler is leaving some settings behind that are interfering with the subsequent ones, but I haven't been able to find anything. I can rearrange the order of their execution and it is always the first in line that works while the others fail.

Any thoughts?

Solution

I fixed the issue by combining each crawler into one script and running them with CrawlerProcess.

spider_settings = [
    {"FEEDS":{
        xh_crawl_results:{
            'format':'json',
            'overwrite':True
        }}},
    {"FEEDS":{
        cd_crawl_results:{
            'format':'json',
            'overwrite':True
        }}},
    {"FEEDS":{
        gt_crawl_results:{
            'format':'json',
            'overwrite':True
        }}},
    {"FEEDS":{
        pd_crawl_results:{
            'format':'json',
            'overwrite':True
        }}}
]
    
process_xh = CrawlerRunner(spider_settings[0])
process_cd = CrawlerRunner(spider_settings[1])
process_gt = CrawlerRunner(spider_settings[2])
process_pd = CrawlerRunner(spider_settings[3])

@defer.inlineCallbacks
def crawl():
    yield process_xh.crawl(XinhuaSpider)
    yield process_cd.crawl(ChinaDailySpider)
    yield process_gt.crawl(GlobalTimesSpider)
    yield process_pd.crawl(PeoplesDailySpider)
    reactor.stop()

print("Scraping started.")
crawl()
reactor.run()
print("Scraping completed.")

Answered By - KCpremo

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, August 15, 2022

[FIXED] First Scrapy crawler works, subsequent crawlers in sequence fail

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels