Thursday, March 24, 2022

[FIXED] Scrapy rows output is in the wrong order

March 24, 2022 python, python-3.x, scrapy, web-scraping No comments

Issue

How can I manipulate the ORDER of the ROWS - or simply output them in the same order as they appear on the website?

(I cannot output the results according to the order on the website in the .csv file.)

I have managed to arrange the columns with FEED_EXPORT_FIELDS (via settings.py):

FEED_EXPORT_FIELDS = ["brandname", "devicecount", "phonename"]

However, every attempt to sort the rows has been unsuccessful.

This is the code:

import scrapy
from gsm.items import GsmItem
    
class GsmSpider(scrapy.Spider):
    name = 'gsm'
    allowed_domains = ['gsmarena.com']
    start_urls = ['https://gsmarena.com/makers.php3']
    
    # LEVEL 1 | all brands

    def parse(self, response):
        
        item = GsmItem()

        gsms = response.xpath('//div[@class="st-text"]/table//td')
        for gsm in gsms:
            allbranddevicesurl = gsm.xpath('.//a/@href').get()
            brandname = gsm.xpath('.//a/text()').get()
            devicecount = gsm.xpath('.//span/text()').get()
            
            item['brandname'] = brandname
            item['devicecount'] = devicecount

            yield response.follow(allbranddevicesurl, callback=self.parse_allbranddevicesurl,
                                    meta= {'brandname': item,
                                           'devicecount': item
                                    })

    # LEVEL 2 | all devices

    def parse_allbranddevicesurl(self, response):
        
        item = response.meta['brandname']       
        item = response.meta['devicecount'] 

        phones = response.xpath('//*[@id="review-body"]//li')
        for phone in phones:
            detailpageurl = phone.xpath('.//a/@href').get()

            yield response.follow(detailpageurl,
                                    callback=self.parse_detailpage,
                                    meta= {'brandname': item,
                                           'devicecount': item
                                    })

        next_page = response.xpath('//a[@class="pages-next"]/@href').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse_allbranddevicesurl,
                                    meta= {'brandname': item,
                                           'devicecount': item
                                    })

    # LEVEL 3 | detailpage

    def parse_detailpage(self, response):
     
        item = response.meta['brandname']       
        item = response.meta['devicecount']
         
        details = response.xpath('//div[@class="article-info"]')
        for detail in details:
            phonename = detail.xpath('.//h1/text()').get()
                
            yield item

I would be grateful for a suggestion on how to solve this problem.

Solution

The solution is to introduce the following custom settings in settings.py:

DEPTH_PRIORITY = 1
CONCURRENT_REQUESTS = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

The solution is also mentioned here: scrapy-spider-output-in-chronological-order
and here does-scrapy-crawl-in-breadth-first-or-depth-first-order

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, March 24, 2022

[FIXED] Scrapy rows output is in the wrong order

Issue

How can I manipulate the ORDER of the ROWS - or simply output them in the same order as they appear on the website?

Solution

0 comments:

Post a Comment

Popular Posts

Labels