Issue
How can I manipulate the ORDER of the ROWS - or simply output them in the same order as they appear on the website?
(I cannot output the results according to the order on the website in the .csv file.)
I have managed to arrange the columns with FEED_EXPORT_FIELDS (via settings.py):
FEED_EXPORT_FIELDS = ["brandname", "devicecount", "phonename"]
However, every attempt to sort the rows has been unsuccessful.
This is the code:
import scrapy
from gsm.items import GsmItem
class GsmSpider(scrapy.Spider):
name = 'gsm'
allowed_domains = ['gsmarena.com']
start_urls = ['https://gsmarena.com/makers.php3']
# LEVEL 1 | all brands
def parse(self, response):
item = GsmItem()
gsms = response.xpath('//div[@class="st-text"]/table//td')
for gsm in gsms:
allbranddevicesurl = gsm.xpath('.//a/@href').get()
brandname = gsm.xpath('.//a/text()').get()
devicecount = gsm.xpath('.//span/text()').get()
item['brandname'] = brandname
item['devicecount'] = devicecount
yield response.follow(allbranddevicesurl, callback=self.parse_allbranddevicesurl,
meta= {'brandname': item,
'devicecount': item
})
# LEVEL 2 | all devices
def parse_allbranddevicesurl(self, response):
item = response.meta['brandname']
item = response.meta['devicecount']
phones = response.xpath('//*[@id="review-body"]//li')
for phone in phones:
detailpageurl = phone.xpath('.//a/@href').get()
yield response.follow(detailpageurl,
callback=self.parse_detailpage,
meta= {'brandname': item,
'devicecount': item
})
next_page = response.xpath('//a[@class="pages-next"]/@href').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse_allbranddevicesurl,
meta= {'brandname': item,
'devicecount': item
})
# LEVEL 3 | detailpage
def parse_detailpage(self, response):
item = response.meta['brandname']
item = response.meta['devicecount']
details = response.xpath('//div[@class="article-info"]')
for detail in details:
phonename = detail.xpath('.//h1/text()').get()
yield item
I would be grateful for a suggestion on how to solve this problem.
Solution
The solution is to introduce the following custom settings in settings.py:
DEPTH_PRIORITY = 1
CONCURRENT_REQUESTS = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
The solution is also mentioned here:
scrapy-spider-output-in-chronological-order
and here
does-scrapy-crawl-in-breadth-first-or-depth-first-order
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.