Tuesday, November 9, 2021

[FIXED] Scrapy cannot reach start_urls: DEBUG: Crawled (200) and ERROR

November 09, 2021 python, scrapy, web-crawler, web-scraping No comments

Issue

I am trying to scrape information with Scrapy from a sneaker website for a university project. The idea is to tell Scrapy to follow each link per shoe and scrape four information points (name, release_date, retail_price, resell_price). Then go back to prior site and click on the next link and do the same scraping again. At the end of the page, click onto the next page and repeat until no more further links.

However, I am always running into a DEBUG and ERROR message when Scrapy tries to reach given start_url.

2020-04-06 11:59:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stockx.com/sneakers/release-date?page=1> (referer: None)
2020-04-06 11:59:56 [scrapy.core.scraper] ERROR: Spider error processing <GET https://stockx.com/sneakers/release-date?page=1> (referer: None)

Here's the code:

import scrapy

class Spider200406Item(scrapy.Item):
    link = scrapy.Field()
    name = scrapy.Field()
    release_date = scrapy.Field()
    retail_price = scrapy.Field()
    resell_price = scrapy.Field()


class Spider200406Spider(scrapy.Spider):
    name = 'spider_200406'
    allowed_domains = ['www.stockx.com']
    start_urls = ['https://stockx.com/sneakers/release-date?page=1']

    BASE_URL = 'https://stockx.com/sneakers/release-date'

    def parse(self, response):
        links = response.xpath('//a[@class="TileBody-sc-1d2ws1l-0 bKAXcS"/@href').extract()
        for link in links:
            absolute_url = self.BASE_URL + link
            yield scrapy.Request(absolute_url, callback=self.parse_info)

    def parse_info(self, response):
        item = Spider200406Item()
        item["link"] = response.url
        item["name"] = "".join(response.xpath("//h1[@class='name']//text()").extract())
        item["release_date"] = "".join(response.xpath("//span[@data-testid='product-detail-release date']//text()").extract())
        item["retail_price"] = "".join(response.xpath("//span[@data-testid='product-detail-retail price']//text()").extract())
        item["resell_price"] = "".join(response.xpath("//div[@class='gauge-value']//text()").extract())
        return item

What I have tried so far:
- Changed User-Agent to a non-default
- Changed ROBOTSTXT_OBEY to False in settings.py
- Changed DOWNLOAD_DELAY to 7 in settings.py
- Used Scrapy Shell for further info, yielding a blank link when telling to view(response)
- Checked if information to scrape is JavaScript encrypted (should not)
- Changed start_urls to start_url, Scrapy didn't accept
- Went via a VPN connection

I have also tried the same code structure with a much simpler website. However, I am receiving the same error message, which leads me to the conclusion that something with the code must be wrong.

Entire trace:

2020-04-06 14:33:02 [scrapy.core.engine] INFO: Spider opened
2020-04-06 14:33:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-06 14:33:02 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-04-06 14:33:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stockx.com/sneakers/release-date?page=1> (referer: None)
2020-04-06 14:33:03 [scrapy.core.scraper] ERROR: Spider error processing <GET https://stockx.com/sneakers/release-date?page=1> (referer: None)
Traceback (most recent call last):
  File "/Applications/anaconda3/lib/python3.7/site-packages/parsel/selector.py", line 238, in xpath
    **kwargs)
  File "src/lxml/etree.pyx", line 1581, in lxml.etree._Element.xpath
  File "src/lxml/xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
  File "src/lxml/xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: Invalid predicate

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/utils/defer.py", line 117, in iter_errback
    yield next(it)
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/utils/python.py", line 345, in __next__
    return next(self.data)
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/utils/python.py", line 345, in __next__
    return next(self.data)
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 338, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/Users/ritterm/Desktop/Data2Dollar_Coding/Group_project/stockx_200406/stockx_200406/spiders/spider_200406.py", line 20, in parse
    links = response.xpath('//a[@class="TileBody-sc-1d2ws1l-0 bKAXcS"/@href').extract()
  File "/Applications/anaconda3/lib/python3.7/site-packages/scrapy/http/response/text.py", line 117, in xpath
    return self.selector.xpath(query, **kwargs)
  File "/Applications/anaconda3/lib/python3.7/site-packages/parsel/selector.py", line 242, in xpath
    six.reraise(ValueError, ValueError(msg), sys.exc_info()[2])
  File "/Applications/anaconda3/lib/python3.7/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/Applications/anaconda3/lib/python3.7/site-packages/parsel/selector.py", line 238, in xpath
    **kwargs)
  File "src/lxml/etree.pyx", line 1581, in lxml.etree._Element.xpath
  File "src/lxml/xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
  File "src/lxml/xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
ValueError: XPath error: Invalid predicate in //a[@class="TileBody-sc-1d2ws1l-0 bKAXcS"/@href
2020-04-06 14:33:03 [scrapy.core.engine] INFO: Closing spider (finished)

Any suggestions and ideas are highly appreciated. MR

Solution

There are multiple bugs in your code that prevent scrapy from being successfull.

First of all, as pointed out here, correct your allowed_domains to allowed_domains = ['stockx.com'] or remove the line completely.

Furthermore your BASE_URL is wrong. Change it to: BASE_URL = 'https://stockx.com'

Moreover, as the stack trace suggests, there is an error in your xpath. I resolved it by using a quite easy css selector to get the link to each shoe-page: response.css('.browse-grid a::attr(href)').extract()

So to sum up, the following code should do exactly what you want:

import scrapy

class Spider200406Item(scrapy.Item):
    link = scrapy.Field()
    name = scrapy.Field()
    release_date = scrapy.Field()
    retail_price = scrapy.Field()
    resell_price = scrapy.Field()


class Spider200406Spider(scrapy.Spider):
    name = 'spider_200406'
    start_urls = ['https://stockx.com/sneakers/release-date?page=1']
    allowed_domains = ['stockx.com']
    BASE_URL = 'https://stockx.com'

    def parse(self, response):
        links = response.css('.browse-grid a::attr(href)').extract()
        for link in links:
            absolute_url = self.BASE_URL + link
            yield scrapy.Request(absolute_url, callback=self.parse_info)

    def parse_info(self, response):
        item = Spider200406Item()
        item["link"] = response.url
        item["name"] = "".join(response.xpath("//h1[@class='name']//text()").extract())
        item["release_date"] = "".join(response.xpath("//span[@data-testid='product-detail-release date']//text()").extract())
        item["retail_price"] = "".join(response.xpath("//span[@data-testid='product-detail-retail price']//text()").extract())
        item["resell_price"] = "".join(response.xpath("//div[@class='gauge-value']//text()").extract())
        return item

Make sure that you are using a user agent such as USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36' in your settings.

Answered By - carpa_jo

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, November 9, 2021

[FIXED] Scrapy cannot reach start_urls: DEBUG: Crawled (200) and ERROR

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels