Sunday, January 30, 2022

[FIXED] Scrapy printing last output three times instead of three of them

January 30, 2022 python, python-3.x, scrapy, web-scraping No comments

Issue

I've written a script using scrapy to parse some content from a website. To access the relevant part of data from that site I need to use some ids within payload.

I'm trying with this three ids within payload one after another, as in 24842,19902 and 20154. However, I'm trying to check whether the payload is update according to the ids I've used but I noticed that the payload is taking the last id, as in 20154 three times.

To be clearer I wanted to get output like:

{'language': 'en', 'region': 'ww', 'networks': 'Internet', '$top': '20', 'productNodePath': '/24842/'}
{'language': 'en', 'region': 'ww', 'networks': 'Internet', '$top': '20', 'productNodePath': '/19902/'}
{'language': 'en', 'region': 'ww', 'networks': 'Internet', '$top': '20', 'productNodePath': '/20154/'}

What I'm getting instead are:

{'language': 'en', 'region': 'ww', 'networks': 'Internet', '$top': '20', 'productNodePath': '/20154/'}
{'language': 'en', 'region': 'ww', 'networks': 'Internet', '$top': '20', 'productNodePath': '/20154/'}
{'language': 'en', 'region': 'ww', 'networks': 'Internet', '$top': '20', 'productNodePath': '/20154/'}

I've tried with:

import scrapy
from urllib.parse import urlencode
from scrapy.crawler import CrawlerProcess

class SiemensSpider(scrapy.Spider):
    name = 'siemens'

    start_link = "https://support.industry.siemens.com/webbackend/api/ProductSupport/ProductSearch?"

    payload = {
        'language': 'en',
        'region': 'ww',
        'networks': 'Internet',
        '$top': '20'
    }

    def start_requests(self):
        for item_id in ['24842','19902','20154']:
            self.payload['productNodePath'] = f"/{item_id}/" #should be updated here
            first_req = f'{self.start_link}{urlencode(self.payload)}'
            yield scrapy.Request(first_req,callback=self.parse)
        
    def parse(self,response):
        print(self.payload)

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
        'LOG_LEVEL':'ERROR'
    })
    c.crawl(SiemensSpider)
    c.start()

How can I achieve the first output?

Solution

You have to use deepcopy in this case combined with meta data in the request.

You are taking a property of the spider and just overwriting one of the values therefore you get the repeated data as mentioned in your problem.

Working example:

import scrapy
from urllib.parse import urlencode
from scrapy.crawler import CrawlerProcess
from copy import deepcopy


class SiemensSpider(scrapy.Spider):
    name = 'siemens'

    start_link = "https://support.industry.siemens.com/webbackend/api/ProductSupport/ProductSearch?"

    payload = {
        'language': 'en',
        'region': 'ww',
        'networks': 'Internet',
        '$top': '20'
    }

    def start_requests(self):
        for item_id in ['24842', '19902', '20154']:
            data = deepcopy(self.payload)
            data['productNodePath'] = f"/{item_id}/"  # should be updated here
            first_req = f'{self.start_link}{urlencode(data)}'
            yield scrapy.Request(first_req, callback=self.parse, meta={'payload': data})

    def parse(self, response):
        payload = response.meta.get('payload')
        print(payload)


if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
        'LOG_LEVEL': 'ERROR'
    })
    c.crawl(SiemensSpider)
    c.start()

Answered By - Ryan

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 30, 2022

[FIXED] Scrapy printing last output three times instead of three of them

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels