Issue
I am trying to scrape information with Scrapy from a sneaker website for a university project. The idea is to tell Scrapy to follow each link per shoe and scrape four information points (name, release_date, retail_price, resell_price). Then go back to prior site and click on the next link and do the same scraping again. At the end of the page, click onto the next page and repeat until no more further links.
However, I am always running into a DEBUG and ERROR message when Scrapy tries to reach given start_url.
2020-04-06 11:59:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stockx.com/sneakers/release-date?page=1> (referer: None)
2020-04-06 11:59:56 [scrapy.core.scraper] ERROR: Spider error processing <GET https://stockx.com/sneakers/release-date?page=1> (referer: None)
Here's the code:
import scrapy
class Spider200406Item(scrapy.Item):
link = scrapy.Field()
name = scrapy.Field()
release_date = scrapy.Field()
retail_price = scrapy.Field()
resell_price = scrapy.Field()
class Spider200406Spider(scrapy.Spider):
name = 'spider_200406'
allowed_domains = ['www.stockx.com']
start_urls = ['https://stockx.com/sneakers/release-date?page=1']
BASE_URL = 'https://stockx.com/sneakers/release-date'
def parse(self, response):
links = response.xpath('//a[@class="TileBody-sc-1d2ws1l-0 bKAXcS"/@href').extract()
for link in links:
absolute_url = self.BASE_URL + link
yield scrapy.Request(absolute_url, callback=self.parse_info)
def parse_info(self, response):
item = Spider200406Item()
item["link"] = response.url
item["name"] = "".join(response.xpath("//h1[@class='name']//text()").extract())
item["release_date"] = "".join(response.xpath("//span[@data-testid='product-detail-release date']//text()").extract())
item["retail_price"] = "".join(response.xpath("//span[@data-testid='product-detail-retail price']//text()").extract())
item["resell_price"] = "".join(response.xpath("//div[@class='gauge-value']//text()").extract())
return item
- What I have tried so far:
- Changed User-Agent to a non-default
- Changed ROBOTSTXT_OBEY to False in settings.py
- Changed DOWNLOAD_DELAY to 7 in settings.py
- Used Scrapy Shell for further info, yielding a blank link when telling to view(response)
- Checked if information to scrape is JavaScript encrypted (should not)
- Changed start_urls to start_url, Scrapy didn't accept
- Went via a VPN connection
I have also tried the same code structure with a much simpler website. However, I am receiving the same error message, which leads me to the conclusion that something with the code must be wrong.
Entire trace:
2020-04-06 14:33:02 [scrapy.core.engine] INFO: Spider opened
2020-04-06 14:33:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-06 14:33:02 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-04-06 14:33:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stockx.com/sneakers/release-date?page=1> (referer: None)
2020-04-06 14:33:03 [scrapy.core.scraper] ERROR: Spider error processing <GET https://stockx.com/sneakers/release-date?page=1> (referer: None)
Traceback (most recent call last):
File "/Applications/anaconda3/lib/python3.7/site-packages/parsel/selector.py", line 238, in xpath
**kwargs)
File "src/lxml/etree.pyx", line 1581, in lxml.etree._Element.xpath
File "src/lxml/xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
File "src/lxml/xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: Invalid predicate
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/utils/defer.py", line 117, in iter_errback
yield next(it)
File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/utils/python.py", line 345, in __next__
return next(self.data)
File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/utils/python.py", line 345, in __next__
return next(self.data)
File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 338, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "/Users/ritterm/Desktop/Data2Dollar_Coding/Group_project/stockx_200406/stockx_200406/spiders/spider_200406.py", line 20, in parse
links = response.xpath('//a[@class="TileBody-sc-1d2ws1l-0 bKAXcS"/@href').extract()
File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/http/response/text.py", line 117, in xpath
return self.selector.xpath(query, **kwargs)
File "/Applications/anaconda3/lib/python3.7/site-packages/parsel/selector.py", line 242, in xpath
six.reraise(ValueError, ValueError(msg), sys.exc_info()[2])
File "/Applications/anaconda3/lib/python3.7/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/Applications/anaconda3/lib/python3.7/site-packages/parsel/selector.py", line 238, in xpath
**kwargs)
File "src/lxml/etree.pyx", line 1581, in lxml.etree._Element.xpath
File "src/lxml/xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
File "src/lxml/xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
ValueError: XPath error: Invalid predicate in //a[@class="TileBody-sc-1d2ws1l-0 bKAXcS"/@href
2020-04-06 14:33:03 [scrapy.core.engine] INFO: Closing spider (finished)
Any suggestions and ideas are highly appreciated. MR
Solution
There are multiple bugs in your code that prevent scrapy from being successfull.
First of all, as pointed out here, correct your allowed_domains to allowed_domains = ['stockx.com']
or remove the line completely.
Furthermore your BASE_URL is wrong. Change it to: BASE_URL = 'https://stockx.com'
Moreover, as the stack trace suggests, there is an error in your xpath. I resolved it by using a quite easy css selector to get the link to each shoe-page: response.css('.browse-grid a::attr(href)').extract()
So to sum up, the following code should do exactly what you want:
import scrapy
class Spider200406Item(scrapy.Item):
link = scrapy.Field()
name = scrapy.Field()
release_date = scrapy.Field()
retail_price = scrapy.Field()
resell_price = scrapy.Field()
class Spider200406Spider(scrapy.Spider):
name = 'spider_200406'
start_urls = ['https://stockx.com/sneakers/release-date?page=1']
allowed_domains = ['stockx.com']
BASE_URL = 'https://stockx.com'
def parse(self, response):
links = response.css('.browse-grid a::attr(href)').extract()
for link in links:
absolute_url = self.BASE_URL + link
yield scrapy.Request(absolute_url, callback=self.parse_info)
def parse_info(self, response):
item = Spider200406Item()
item["link"] = response.url
item["name"] = "".join(response.xpath("//h1[@class='name']//text()").extract())
item["release_date"] = "".join(response.xpath("//span[@data-testid='product-detail-release date']//text()").extract())
item["retail_price"] = "".join(response.xpath("//span[@data-testid='product-detail-retail price']//text()").extract())
item["resell_price"] = "".join(response.xpath("//div[@class='gauge-value']//text()").extract())
return item
Make sure that you are using a user agent such as USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
in your settings.
Answered By - carpa_jo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.