Issue
I'm trying to parse different ids from some json response using scrapy but I can't make it possible whereas I get success using requests module. I'm trying to get the ids of different article numbers from this website. Ids look like 1397099
, 539728
e.t.c which requests version can fetch flawlessly.
Using requests (succeeded):
import json
import requests
link = "https://support.industry.siemens.com/webbackend/api/ProductSupport/ProductSearch?"
payload = {
'language': 'en',
'region': 'ww',
'networks': 'Internet',
'productNodePath': '/13204/',
'$top': '20'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
r = s.get(link,params=payload)
for item in r.json()['Products']:
print(item['Id'])
Using scrapy (failed):
import scrapy
import json
from urllib.parse import urlencode
from scrapy.crawler import CrawlerProcess
class SiemensSpider(scrapy.Spider):
name = 'siemens'
start_link = "https://support.industry.siemens.com/webbackend/api/ProductSupport/ProductSearch?"
payload = {
'language': 'en',
'region': 'ww',
'networks': 'Internet',
'productNodePath': '/13204/',
'$top': '20'
}
def start_requests(self):
first_req = f'{self.start_link}{urlencode(self.payload)}'
yield scrapy.Request(first_req,callback=self.parse)
def parse(self,response):
for item in json.loads(response.body_as_unicode())['Products']:
print(item['Id'])
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
})
c.crawl(SiemensSpider)
c.start()
When I run the scrapy code, I get some response which are not json and that is why I get this error json.decoder.JSONDecodeError
.
How can I parse ids from json using scrapy?
Solution
It seems to be an issue with the headers.
I opened one of the links via a browser and I saw that xml was being returned.
I modified the headers of the spider so that it requests json and it worked as expected:
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
}
def start_requests(self):
first_req = f'{self.start_link}{urlencode(self.payload)}'
yield scrapy.Request(first_req, callback=self.parse, headers=self.headers)
Full code:
import scrapy
import json
from urllib.parse import urlencode
from scrapy.crawler import CrawlerProcess
class SiemensSpider(scrapy.Spider):
name = 'siemens'
start_link = "https://support.industry.siemens.com/webbackend/api/ProductSupport/ProductSearch?"
payload = {
'language': 'en',
'region': 'ww',
'networks': 'Internet',
'productNodePath': '/13204/',
'$top': '20'
}
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
}
def start_requests(self):
first_req = f'{self.start_link}{urlencode(self.payload)}'
yield scrapy.Request(first_req, callback=self.parse, headers=self.headers)
def parse(self, response):
for item in json.loads(response.text)['Products']:
print(item['Id'])
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
})
c.crawl(SiemensSpider)
c.start()
I imagine that the requests
library must request json by default.
Answered By - Ryan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.