Thursday, January 27, 2022

[FIXED] Why is scrapy shell not working for this url?

January 27, 2022 python, python-3.x, scrapy, web-scraping No comments

Issue

I am using scrapy, python's webscraping library to scrape some data. I used the 'scrapy shell url' for other pages fine, but for some reason it doesn't work for the url below

https://feedback.aliexpress.com/display/productEvaluation.htm?&productId=4000033626141&ownerMemberId=22

scrapy shell should result in a '>>>' and from there you play around with the response object. when doing scrapy shell with the exact same spider, but with the link above, it doesn't result in anything. Below is the output

scrapy shell https://feedback.aliexpress.com/display/productEvaluation.htm?&productId=4000033626141&ownerMemberId=22
[12] 40068
[13] 40069
[10]   Exit 1                  scrapy shell https://feedback.aliexpress.com/display/productEvaluation.htm?
[11]   Done                    productId=4000033626141
(aliexpress_scraping) (base) othos-Air:aliexpress_scraping otho$ 2020-12-01 22:08:21 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: googleweblight)
2020-12-01 22:08:21 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.0 (default, Nov  1 2020, 22:24:33) - [GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.46.4)], pyOpenSSL 20.0.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.2.1, Platform Darwin-18.7.0-x86_64-i386-64bit
2020-12-01 22:08:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-12-01 22:08:21 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'googleweblight',
 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'LOGSTATS_INTERVAL': 0,
 'NEWSPIDER_MODULE': 'aliexpress_scraper.spiders',
 'SPIDER_MODULES': ['aliexpress_scraper.spiders']}
2020-12-01 22:08:21 [scrapy.extensions.telnet] INFO: Telnet Password: 642763dfc6c8dd26
2020-12-01 22:08:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2020-12-01 22:08:21 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-12-01 22:08:21 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-12-01 22:08:21 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-12-01 22:08:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6030
2020-12-01 22:08:22 [scrapy.core.engine] INFO: Spider opened
2020-12-01 22:08:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://feedback.aliexpress.com/display/productEvaluation.htm> (referer: None)

Solution

That URL contains characters (e.g. &) that are special characters in some system shells (e.g. Bash).

If you are using a Linux shell, things will probably work if you put the URL between single quotes:

$ scrapy shell 'https://feedback.aliexpress.com/display/productEvaluation.htm?&productId=4000033626141&ownerMemberId=22'

Alternatively, once inside the Scrapy shell, you can use fetch(<URL as a Python string>) to load a different URL.

Answered By - Gallaecio

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 27, 2022

[FIXED] Why is scrapy shell not working for this url?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels