Issue
I'm quite new to web scraping on Python. Currently trying to crawl through Amazon's latest books. As on many tutorials, i use the Random User-Agent middleware picks up as in this link:
At first I managed to crawl the web page. However, in the past few days, python only return "Spider error processing". Perhaps it's because Amaz0n is blocking user agent or that there's something missing in my code which I cannot find.
Here's what the terminal returns:
2020-10-22 01:37:59 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapyamazon)
2020-10-22 01:37:59 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.3 (default, Jul 2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.18362-SP0
2020-10-22 01:37:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-22 01:37:59 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapyamazon',
'NEWSPIDER_MODULE': 'scrapyamazon.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapyamazon.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
2020-10-22 01:37:59 [scrapy.extensions.telnet] INFO: Telnet Password: cd809e0ec7c2ec6a
2020-10-22 01:37:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-10-22 01:37:59 [faker.factory] DEBUG: Not in REPL -> leaving logger event level as is.
2020-10-22 01:37:59 [scrapy_fake_useragent.middleware] DEBUG: Loaded User-Agent provider: scrapy_fake_useragent.providers.FakeUserAgentProvider
2020-10-22 01:37:59 [scrapy_fake_useragent.middleware] INFO: Using '<class 'scrapy_fake_useragent.providers.FakeUserAgentProvider'>' as the User-Agent provider
2020-10-22 01:37:59 [scrapy_fake_useragent.middleware] DEBUG: Loaded User-Agent provider: scrapy_fake_useragent.providers.FakeUserAgentProvider
2020-10-22 01:37:59 [scrapy_fake_useragent.middleware] INFO: Using '<class 'scrapy_fake_useragent.providers.FakeUserAgentProvider'>' as the User-Agent provider
2020-10-22 01:37:59 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware',
'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-22 01:37:59 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-22 01:37:59 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapyamazon.pipelines.ScrapyamazonPipeline']
2020-10-22 01:37:59 [scrapy.core.engine] INFO: Spider opened
2020-10-22 01:37:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-22 01:37:59 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-22 01:38:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/robots.txt> (referer: None)
2020-10-22 01:38:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/Books-Last-30-days/s?rh=n%3A283155%2Cp_n_publication_date%3A1250226011>
(referer: None)
2020-10-22 01:38:01 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.amazon.com/Books-Last-30-days/s?rh=n%3A283155%2Cp_n_publication_date%3A1250226011> (referer: None)
Traceback (most recent call last):
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
yield next(it)
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\python.py", line 347, in __next__
return next(self.data)
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\python.py", line 347, in __next__
return next(self.data)
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 340, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\core\spidermw.py", line 64, in _evaluate_iterable
for r in iterable:
File "C:\Users\freud\Documents\Demos\vstoolbox\scrapyamazon\scrapyamazon\spiders\amazon_spider.py", line 22, in parse
price_kindle = response.css("div[./a[contains(text(),'Kindle')]]")
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\scrapy\http\response\text.py", line 142, in css
return self.selector.css(query)
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\parsel\selector.py", line 264, in css
return self.xpath(self._css2xpath(query))
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\parsel\selector.py", line 267, in _css2xpath
return self._csstranslator.css_to_xpath(query)
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\parsel\csstranslator.py", line 109, in css_to_xpath
return super(HTMLTranslator, self).css_to_xpath(css, prefix)
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\cssselect\xpath.py", line 192, in css_to_xpath
for selector in parse(css))
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\cssselect\parser.py", line 415, in parse
return list(parse_selector_group(stream))
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\cssselect\parser.py", line 428, in parse_selector_group
yield Selector(*parse_selector(stream))
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\cssselect\parser.py", line 436, in parse_selector
result, pseudo_element = parse_simple_selector(stream)
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\cssselect\parser.py", line 498, in parse_simple_selector
result = parse_attrib(result, stream)
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\cssselect\parser.py", line 569, in parse_attrib
attrib = stream.next_ident_or_star()
File "C:\Users\freud\anaconda3\envs\condatest\lib\site-packages\cssselect\parser.py", line 829, in next_ident_or_star
raise SelectorSyntaxError(
File "<string>", line None
cssselect.parser.SelectorSyntaxError: Expected ident or '*', got <DELIM '.' at 4>
2020-10-22 01:38:01 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-22 01:38:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 678,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 4242,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 1.312107,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 21, 18, 38, 1, 219171),
'log_count/DEBUG': 5,
'log_count/ERROR': 1,
'log_count/INFO': 12,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/SelectorSyntaxError': 1,
'start_time': datetime.datetime(2020, 10, 21, 18, 37, 59, 907064)}
2020-10-22 01:38:01 [scrapy.core.engine] INFO: Spider closed (finished)
Sorry to put everything here because I'm not sure where the error stems from. But I draw the conclusion that this has something to do with the user-agent being blocked/not recognised from this line: 2020-10-22 01:38:01 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.amazon.com/Books-Last-30-days/s?rh=n%3A283155%2Cp_n_publication_date%3A1250226011> (referer: None) Traceback (most recent call last):
For your reference, here's my full code:
- amazon_spider.py: https://pastebin.com/tBQqa2jQ
- items.py: https://pastebin.com/rkUTqWSz
- pipelines.py: https://pastebin.com/2fkYf57f
- settings.py: https://pastebin.com/ZtLXqsyW
Thank you in advance for your help!
Solution
The problem is that you are calling a method for a CSS selector and passing an XPath.
price_kindle = response.css("div[./a[contains(text(),'Kindle')]]")
Change to
price_kindle = response.xpath("div[./a[contains(text(),'Kindle')]]")
By the way, this is unrelated to the problem, but you are assigning values two times to the same variable, so the first will get overwritten by the second. Here:
price_kindle = response.css("div[./a[contains(text(),'Kindle')]]")
price_kindle = price_hardcover.xpath("./following-sibling::div//span[contains(@class,'a-offscreen')]/text()").extract(
However, as I mention, this is not related to the question. Just a heads-up.
Answered By - renatodvc
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.