Issue
I am using scrapy
and running this script:
import scrapy
from ..items import SizeerItem
from scrapy.http.request import Request
class SizeerSpiderSpider(scrapy.Spider):
name = 'sizeer'
pg = 0
currentPg = 2
start_urls = [
'https://sizeer.lt/moterims'
]
def parse(self, response):
items = SizeerItem()
pages = response.xpath("//nav[@class='m-pagination']//span[3]/text()").extract()
pages = list(dict.fromkeys(pages))
if self.pg == 0:
pages = list(int(s) for s in pages[0].split() if s.isdigit())
self.pg = pages[0]
name = response.xpath("//div[@class='b-productList_content']//a/@href").extract()
items['name'] = list(dict.fromkeys(name))
while self.currentPg <= self.pg:
url = response.request.url + "?sort=default&limit=60&page=" + str(self.currentPg)
self.currentPg += 1
yield Request(url, callback=self.parse)
This way:
scrapy crawl sizeer -s FEED_URI='mydata.json' -s FEED_FORMAT=json
But after that my mydata.json
is empty. This is my first time trying to 'play' with it and can't really understand where is the issue.
Solution
You also need to yield
the items you scrape so Scrapy Engine will run them through the pipelines and thorugh the Feed Export (which is what you need to export to the file).
Since yield
is non-blocking you can add just after populating it and the function will still yield
your requests after:
...
name = response.xpath("//div[@class='b-productList_content']//a/@href").extract()
items['name'] = list(dict.fromkeys(name))
yield items # <<< Here for example
while self.currentPg <= self.pg:
...
As @yordan pointed out, you can simplify the way you are executing the spider like this: (However it's not the solution to the problem)
scrapy crawl sizeer -o mydata.json
Answered By - renatodvc
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.