Sunday, June 26, 2022

[FIXED] Sequential scraping from multiple start_urls leading to error in parsing

June 26, 2022 python, scrapy No comments

Issue

First, highest appreciation for all of your work answering noob questions like this one.

Second, as it seems to be a quite common problem I was finding (IMO) related questions such as: Scrapy: Wait for a specific url to be parsed before parsing others

However, at my current state of understanding it is not straightforward to adapt the suggestions in my specific case and I would really appreciate your help.

Problem Outline: running on (Python 3.7.1, Scrapy 1.5.1)

I want to scrape data from every link collected on pages like this https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1

then from all links on another collection

https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650

I manage to get the desired information (only two elements shown here) if I run the spider for one (e.g. page 1 or 650) at a time. (Note that I restircted the length of links that is crawled per page to 2.) However, once I have multiple start start_urls (setting two elements in the list [1,650] in the code below) the parsed data is no more consistent. Apparently at least one element is not found by xpath. I am suspecting some (or a lot of) incorrect logic how I handle/pass the requests that leads not to the intendet order for parsing.

Code:

class SlfSpider1Spider(CrawlSpider):
    name = 'slf_spider1'
    custom_settings = { 'CONCURRENT_REQUESTS': '1' }    
    allowed_domains = ['gipfelbuch.ch']
    start_urls = ['https://www.gipfelbuch.ch/gipfelbuch/touren/seite/'+str(i) for i in [1,650]]

    # Method which starts the requests by vicisting all URLS specified in start_urls
    def start_requests(self):
        for url in self.start_urls:
            print('#### START REQUESTS: ',url)
            yield scrapy.Request(url, callback=self.parse_verhaeltnisse, dont_filter=True)

    def parse_verhaeltnisse(self,response):
        links = response.xpath('//td//@href').extract()
        for link in links[0:2]:
            print('##### PARSING: ',link)
            abs_link = 'https://www.gipfelbuch.ch/'+link
            yield scrapy.Request(abs_link, callback=self.parse_gipfelbuch_item, dont_filter=True)


    def parse_gipfelbuch_item(self, response):
        route = response.xpath('/html/body/main/div[4]/div[@class="col_f"]//div[@class="togglebox cont_item mt"]//div[@class="label_container"]')

        print('#### PARSER OUTPUT: ')

        key=[route[i].xpath('string(./label)').extract()[0] for i in range(len(route))]
        value=[route[i].xpath('string(div[@class="label_content"])').extract()[0] for i in range(len(route))]
        fields = dict(zip(key,value))

        print('Route: ', fields['Gipfelname'])
        print('Comments: ', fields['Verhältnis-Beschreibung'])

        print('Length of dict extracted from Route: {}'.format(len(route)))
        return

Command prompt

2019-03-18 15:42:27 [scrapy.core.engine] INFO: Spider opened
2019-03-18 15:42:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-03-18 15:42:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
#### START REQUESTS:  https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1
2019-03-18 15:42:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1> (referer: None)
#### START REQUESTS:  https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650
##### PARSING:  /gipfelbuch/detail/id/101559/Skitour_Snowboardtour/Beaufort
##### PARSING:  /gipfelbuch/detail/id/101557/Skitour_Snowboardtour/Blinnenhorn
2019-03-18 15:42:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650> (referer: None)
##### PARSING:  /gipfelbuch/detail/id/69022/Alpine_Wanderung/Schwaendeliflue
##### PARSING:  /gipfelbuch/detail/id/69021/Schneeschuhtour/Cima_Portule

2019-03-18 15:42:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/101557/Skitour_Snowboardtour/Blinnenhorn> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1)
#### PARSER OUTPUT:
Route:  Blinnenhorn/Corno Cieco
Comments:  Am Samstag Aufstieg zur Corno Gries Hütte, ca. 2,5h ab All Acqua. Zustieg problemslos auf guter Spur. Zur Verwunderung waren wir die einzigsten auf der Hütte. Danke an Monika für die herzliche Bewirtung...
Length of dict extracted from Route: 27

2019-03-18 15:42:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/69021/Schneeschuhtour/Cima_Portule> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650)
#### PARSER OUTPUT:
Route:  Cima Portule
Comments:  Sehr viel Schnee in dieser Gegend und viel Spirarbeit geleiset, deshalb auch viel Zeit gebraucht.
Length of dict extracted from Route: 19

2019-03-18 15:42:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/69022/Alpine_Wanderung/Schwaendeliflue> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/650)
#### PARSER OUTPUT:
Route:  Schwändeliflue
Comments:  Wege und Pfade meist schneefrei, da im Gebiet viel Hochmoor ist, z.t. sumpfig.  Oberhalb 1600m und in Schattenlagen bis 1400m etwas Schnee  (max.Schuhtief).  Wetter sonnig und sehr warm für die Jahreszeit, T-Shirt - Wetter,  Frühlingshaft....
Length of dict extracted from Route: 17

2019-03-18 15:42:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/101559/Skitour_Snowboardtour/Beaufort> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1)
#### PARSER OUTPUT:
Route:  Beaufort
2019-03-18 15:42:40 [scrapy.core.scraper] **ERROR: Spider error processing <GET https://www.gipfelbuch.ch//gipfelbuch/detail/id/101559/Skitour_Snowboardtour/Beaufort> (referer: https://www.gipfelbuch.ch/gipfelbuch/touren/seite/1)
Traceback (most recent call last):
  File "C:\Users\Lenovo\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\Lenovo\Dropbox\Code\avalanche\scrapy\slf1\slf1\spiders\slf_spider1.py", line 38, in parse_gipfelbuch_item
    print('Comments: ', fields['Verhältnis-Beschreibung'])
**KeyError: 'Verhältnis-Beschreibung'****
2019-03-18 15:42:40 [scrapy.core.engine] INFO: Closing spider (finished)

Question: How do I have to structure the first (for links) and second (for content) parsing commands correctly? Why is the "PARSE OUTPUT" not in the order i would expect (first for page 1, links top to bottom, then page 2, links top to bottom)?

I already tried to reduce the number of CONCURRENT_REQUESTS = 1 and DOWNLOAD_DELAY = 2.

I hope the question is clear enough... big thanks in advance.

Solution

If the problem is to visit more URLs at the same time, you can visit one by one, using the signal spider_idle (https://docs.scrapy.org/en/latest/topics/signals.html).

The idea is the following:

1.start_requests only visits the first URL

2.when the spider gets idle, the method spider_idle is called

3.the method spider_idle deletes the first URL and visits the second URL

4.so on...

The code would be something like this (I didn't try it):

class SlfSpider1Spider(CrawlSpider):
    name = 'slf_spider1'
    custom_settings = { 'CONCURRENT_REQUESTS': '1' }   
    allowed_domains = ['gipfelbuch.ch']
    start_urls = ['https://www.gipfelbuch.ch/gipfelbuch/touren/seite/'+str(i) for i in [1,650]]

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(SlfSpider1Spider, cls).from_crawler(crawler, *args, **kwargs)
        # Here you set which method the spider has to run when it gets idle
        crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
        return spider

    # Method which starts the requests by vicisting all URLS specified in start_urls
    def start_requests(self):
        # the spider visits only the first provided URL
        url = self.start_urls[0]:
        print('#### START REQUESTS: ',url)
        yield scrapy.Request(url, callback=self.parse_verhaeltnisse, dont_filter=True)

    def parse_verhaeltnisse(self,response):
        links = response.xpath('//td//@href').extract()
        for link in links[0:2]:
            print('##### PARSING: ',link)
            abs_link = 'https://www.gipfelbuch.ch/'+link
            yield scrapy.Request(abs_link, callback=self.parse_gipfelbuch_item, dont_filter=True)


    def parse_gipfelbuch_item(self, response):
        route = response.xpath('/html/body/main/div[4]/div[@class="col_f"]//div[@class="togglebox cont_item mt"]//div[@class="label_container"]')

        print('#### PARSER OUTPUT: ')

        key=[route[i].xpath('string(./label)').extract()[0] for i in range(len(route))]
        value=[route[i].xpath('string(div[@class="label_content"])').extract()[0] for i in range(len(route))]
        fields = dict(zip(key,value))

        print('Route: ', fields['Gipfelname'])
        print('Comments: ', fields['Verhältnis-Beschreibung'])

        print('Length of dict extracted from Route: {}'.format(len(route)))
        return

    # When the spider gets idle, it deletes the first url and visits the second, and so on...
    def spider_idle(self, spider):
        del(self.start_urls[0])
        if len(self.start_urls)>0:
            url = self.start_urls[0]
            self.crawler.engine.crawl(Request(url, callback=self.parse_verhaeltnisse, dont_filter=True), spider)

Answered By - Stefano Fiorucci - anakin87

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, June 26, 2022

[FIXED] Sequential scraping from multiple start_urls leading to error in parsing

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels