Friday, November 26, 2021

[FIXED] Error while obtaining start requests with Scrapy

November 26, 2021 python, scrapy No comments

Issue

I am having some trouble trying to scrape through these 2 specific pages and don't really see where the problem is. If you have any ideas or advices I am all ears ! Thanks in advance !

import scrapy


class SneakersSpider(scrapy.Spider):
    name = "sneakers"
    
    def start_requests(self):
        headers = {'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
        urls = [ 
            #"https://stockx.com/fr-fr/retro-jordans",
            "https://stockx.com/fr-fr/retro-jordans?page=2",
            "https://stockx.com/fr-fr/retro-jordans?page=3",
            ]
        for url in urls:
            yield scrapy.Request(url = url, callback =self.parse ,headers = headers)
            
    def parse(self,response):
        page = response.url.split("=")[-1]
        filename = f'sneakers-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

Solution

Looking at the traceback always helps. You should see something like this in your spider's output:

Traceback (most recent call last):
  File "c:\program files\python37\lib\site-packages\scrapy\core\engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "D:\Users\Ivan\Documents\Python\a.py", line 15, in start_requests
    yield scrapy.Request(url = url, callback =self.parse ,headers = headers)
  File "c:\program files\python37\lib\site-packages\scrapy\http\request\__init__.py", line 39, in __init__
    self.headers = Headers(headers or {}, encoding=encoding)
  File "c:\program files\python37\lib\site-packages\scrapy\http\headers.py", line 12, in __init__
    super(Headers, self).__init__(seq)
  File "c:\program files\python37\lib\site-packages\scrapy\utils\datatypes.py", line 193, in __init__
    self.update(seq)
  File "c:\program files\python37\lib\site-packages\scrapy\utils\datatypes.py", line 229, in update
    super(CaselessDict, self).update(iseq)
  File "c:\program files\python37\lib\site-packages\scrapy\utils\datatypes.py", line 228, in <genexpr>
    iseq = ((self.normkey(k), self.normvalue(v)) for k, v in seq)
ValueError: too many values to unpack (expected 2)

As you can see, there is a problem in the code that handles request headers.

headers is a set in your code; it should be a dict instead.
This works without a problem:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}

Another way to set a default user agent for all requests is using the USER_AGENT setting.

Answered By - stranac

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 26, 2021

[FIXED] Error while obtaining start requests with Scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels