Issue
I've written a script using scrapy to parse some content from a website. To access the relevant part of data from that site I need to use some ids within payload.
I'm trying with this three ids within payload one after another, as in 24842
,19902
and 20154
. However, I'm trying to check whether the payload is update according to the ids I've used but I noticed that the payload is taking the last id, as in 20154
three times.
To be clearer I wanted to get output like:
{'language': 'en', 'region': 'ww', 'networks': 'Internet', '$top': '20', 'productNodePath': '/24842/'}
{'language': 'en', 'region': 'ww', 'networks': 'Internet', '$top': '20', 'productNodePath': '/19902/'}
{'language': 'en', 'region': 'ww', 'networks': 'Internet', '$top': '20', 'productNodePath': '/20154/'}
What I'm getting instead are:
{'language': 'en', 'region': 'ww', 'networks': 'Internet', '$top': '20', 'productNodePath': '/20154/'}
{'language': 'en', 'region': 'ww', 'networks': 'Internet', '$top': '20', 'productNodePath': '/20154/'}
{'language': 'en', 'region': 'ww', 'networks': 'Internet', '$top': '20', 'productNodePath': '/20154/'}
I've tried with:
import scrapy
from urllib.parse import urlencode
from scrapy.crawler import CrawlerProcess
class SiemensSpider(scrapy.Spider):
name = 'siemens'
start_link = "https://support.industry.siemens.com/webbackend/api/ProductSupport/ProductSearch?"
payload = {
'language': 'en',
'region': 'ww',
'networks': 'Internet',
'$top': '20'
}
def start_requests(self):
for item_id in ['24842','19902','20154']:
self.payload['productNodePath'] = f"/{item_id}/" #should be updated here
first_req = f'{self.start_link}{urlencode(self.payload)}'
yield scrapy.Request(first_req,callback=self.parse)
def parse(self,response):
print(self.payload)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
'LOG_LEVEL':'ERROR'
})
c.crawl(SiemensSpider)
c.start()
How can I achieve the first output?
Solution
You have to use deepcopy
in this case combined with meta
data in the request.
You are taking a property of the spider and just overwriting one of the values therefore you get the repeated data as mentioned in your problem.
Working example:
import scrapy
from urllib.parse import urlencode
from scrapy.crawler import CrawlerProcess
from copy import deepcopy
class SiemensSpider(scrapy.Spider):
name = 'siemens'
start_link = "https://support.industry.siemens.com/webbackend/api/ProductSupport/ProductSearch?"
payload = {
'language': 'en',
'region': 'ww',
'networks': 'Internet',
'$top': '20'
}
def start_requests(self):
for item_id in ['24842', '19902', '20154']:
data = deepcopy(self.payload)
data['productNodePath'] = f"/{item_id}/" # should be updated here
first_req = f'{self.start_link}{urlencode(data)}'
yield scrapy.Request(first_req, callback=self.parse, meta={'payload': data})
def parse(self, response):
payload = response.meta.get('payload')
print(payload)
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
'LOG_LEVEL': 'ERROR'
})
c.crawl(SiemensSpider)
c.start()
Answered By - Ryan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.