Sunday, January 30, 2022

[FIXED] Unable to parse different ids from json using scrapy whereas I got success using requests

January 30, 2022 json, python, python-3.x, scrapy, web-scraping No comments

Issue

I'm trying to parse different ids from some json response using scrapy but I can't make it possible whereas I get success using requests module. I'm trying to get the ids of different article numbers from this website. Ids look like 1397099, 539728 e.t.c which requests version can fetch flawlessly.

Using requests (succeeded):

import json
import requests

link = "https://support.industry.siemens.com/webbackend/api/ProductSupport/ProductSearch?"
 
payload = {
    'language': 'en',
    'region': 'ww',
    'networks': 'Internet',
    'productNodePath': '/13204/',
    '$top': '20'
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
    r = s.get(link,params=payload)
    for item in r.json()['Products']:
        print(item['Id'])

Using scrapy (failed):

import scrapy
import json
from urllib.parse import urlencode
from scrapy.crawler import CrawlerProcess

class SiemensSpider(scrapy.Spider):
    name = 'siemens'

    start_link = "https://support.industry.siemens.com/webbackend/api/ProductSupport/ProductSearch?"

    payload = {
        'language': 'en',
        'region': 'ww',
        'networks': 'Internet',
        'productNodePath': '/13204/',
        '$top': '20'
    }

    def start_requests(self):
        first_req = f'{self.start_link}{urlencode(self.payload)}'
        yield scrapy.Request(first_req,callback=self.parse)
    
    def parse(self,response):
        for item in json.loads(response.body_as_unicode())['Products']:
            print(item['Id'])

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
    })
    c.crawl(SiemensSpider)
    c.start()

When I run the scrapy code, I get some response which are not json and that is why I get this error json.decoder.JSONDecodeError.

How can I parse ids from json using scrapy?

Solution

It seems to be an issue with the headers.

I opened one of the links via a browser and I saw that xml was being returned.

I modified the headers of the spider so that it requests json and it worked as expected:

    headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
    }


    def start_requests(self):
        first_req = f'{self.start_link}{urlencode(self.payload)}'
        yield scrapy.Request(first_req, callback=self.parse, headers=self.headers)

Full code:

import scrapy
import json
from urllib.parse import urlencode
from scrapy.crawler import CrawlerProcess


class SiemensSpider(scrapy.Spider):
    name = 'siemens'

    start_link = "https://support.industry.siemens.com/webbackend/api/ProductSupport/ProductSearch?"

    payload = {
        'language': 'en',
        'region': 'ww',
        'networks': 'Internet',
        'productNodePath': '/13204/',
        '$top': '20'
    }

    headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
    }


    def start_requests(self):
        first_req = f'{self.start_link}{urlencode(self.payload)}'
        yield scrapy.Request(first_req, callback=self.parse, headers=self.headers)

    def parse(self, response):
        for item in json.loads(response.text)['Products']:
            print(item['Id'])


if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
    })
    c.crawl(SiemensSpider)
    c.start()

I imagine that the requests library must request json by default.

Answered By - Ryan

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 30, 2022

[FIXED] Unable to parse different ids from json using scrapy whereas I got success using requests

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels