Issue
Im using scrapy to scrape my website for 4 columns (stock quantity/name/price/url). I'd like the outputted file to be sorted via alphabetical order from the name column. I can go into the csv and sort it manually but some wizard must know a way to do this in the script?
Code:
import scrapy
from scrapy.crawler import CrawlerProcess
import csv
cs = open('results/2x2_results.csv', 'w', newline="", encoding='utf-8')
header_names = ['stk','name','price','url']
csv_writer = csv.DictWriter(cs, fieldnames=header_names)
csv_writer.writeheader()
class SCXX(scrapy.Spider):
name = 'SCXX'
start_urls = [
'https://website.com'
]
def parse(self,response):
product_urls = response.css('div.grid-uniform a.product-grid-item::attr(href)').extract()
for product_url in product_urls:
yield scrapy.Request(url='https://website.com'+product_url,callback=self.next_parse_two)
next_url = response.css('ul.pagination-custom li a[title="Next »"]::attr(href)').get()
if next_url != None:
yield scrapy.Request(url='https://website.com'+next_url,callback=self.parse)
def next_parse_two(self,response):
item = dict()
item['stk'] = response.css('script#swym-snippet::text').get().split('stk:')[1].split(',')[0]
item['name'] = response.css('h1.h2::text').get()
item['price'] =response.css('span#productPrice-product-template span.visually-hidden::text').get()
item['url'] = response.url
csv_writer.writerow(item)
cs.flush()
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(SCXX)
process.start()
Solution
Solution
Scrapy works async and the requests are processed in no order, imagine a bunch of workers. Some get apples some get bananas some get oranges, how would you sort them, you could tell them to pick each fruit and put it in a basket (this is what we would call inserting or putting in sorted) but in programming this would be too much of a hassle and I would propose just to get the data and basically use sort()
on it afterwards.
The data isn't being written in any order. Everything is launched at once and written on the fly. What you can do is run an after scrape script that will sort it in the end. That's probably the best approach.
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {"format": "json"},
},
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
# Load the JSON and use .sort() on the dict and write it again.
with open('items.json') as file:
data = json.load(file)
data.sort() # we would have to use a specific key to sort it alphabetically like the title.
with open('output.json', 'w') as outfile:
json.dump(data, outfile) (write to a file)
Additional notes
We would preferably write it to a memory stream io
lib, but I am guessing you don't know how to do that and that's why it's easier to just write it to a file and then do the operations on that file.
Let me know if you have any questions
Answered By - innicoder
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.