Issue
I am having some trouble trying to scrape through these 2 specific pages and don't really see where the problem is. If you have any ideas or advices I am all ears ! Thanks in advance !
import scrapy
class SneakersSpider(scrapy.Spider):
name = "sneakers"
def start_requests(self):
headers = {'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
urls = [
#"https://stockx.com/fr-fr/retro-jordans",
"https://stockx.com/fr-fr/retro-jordans?page=2",
"https://stockx.com/fr-fr/retro-jordans?page=3",
]
for url in urls:
yield scrapy.Request(url = url, callback =self.parse ,headers = headers)
def parse(self,response):
page = response.url.split("=")[-1]
filename = f'sneakers-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')
Solution
Looking at the traceback always helps. You should see something like this in your spider's output:
Traceback (most recent call last):
File "c:\program files\python37\lib\site-packages\scrapy\core\engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "D:\Users\Ivan\Documents\Python\a.py", line 15, in start_requests
yield scrapy.Request(url = url, callback =self.parse ,headers = headers)
File "c:\program files\python37\lib\site-packages\scrapy\http\request\__init__.py", line 39, in __init__
self.headers = Headers(headers or {}, encoding=encoding)
File "c:\program files\python37\lib\site-packages\scrapy\http\headers.py", line 12, in __init__
super(Headers, self).__init__(seq)
File "c:\program files\python37\lib\site-packages\scrapy\utils\datatypes.py", line 193, in __init__
self.update(seq)
File "c:\program files\python37\lib\site-packages\scrapy\utils\datatypes.py", line 229, in update
super(CaselessDict, self).update(iseq)
File "c:\program files\python37\lib\site-packages\scrapy\utils\datatypes.py", line 228, in <genexpr>
iseq = ((self.normkey(k), self.normvalue(v)) for k, v in seq)
ValueError: too many values to unpack (expected 2)
As you can see, there is a problem in the code that handles request headers.
headers
is a set in your code; it should be a dict instead.
This works without a problem:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
Another way to set a default user agent for all requests is using the USER_AGENT
setting.
Answered By - stranac
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.