Issue
I am using scrapy, python's webscraping library to scrape some data. I used the 'scrapy shell url' for other pages fine, but for some reason it doesn't work for the url below
scrapy shell should result in a '>>>' and from there you play around with the response object. when doing scrapy shell with the exact same spider, but with the link above, it doesn't result in anything. Below is the output
scrapy shell https://feedback.aliexpress.com/display/productEvaluation.htm?&productId=4000033626141&ownerMemberId=22
[12] 40068
[13] 40069
[10] Exit 1 scrapy shell https://feedback.aliexpress.com/display/productEvaluation.htm?
[11] Done productId=4000033626141
(aliexpress_scraping) (base) othos-Air:aliexpress_scraping otho$ 2020-12-01 22:08:21 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: googleweblight)
2020-12-01 22:08:21 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.0 (default, Nov 1 2020, 22:24:33) - [GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.46.4)], pyOpenSSL 20.0.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.2.1, Platform Darwin-18.7.0-x86_64-i386-64bit
2020-12-01 22:08:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-12-01 22:08:21 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'googleweblight',
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'LOGSTATS_INTERVAL': 0,
'NEWSPIDER_MODULE': 'aliexpress_scraper.spiders',
'SPIDER_MODULES': ['aliexpress_scraper.spiders']}
2020-12-01 22:08:21 [scrapy.extensions.telnet] INFO: Telnet Password: 642763dfc6c8dd26
2020-12-01 22:08:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2020-12-01 22:08:21 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-12-01 22:08:21 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-12-01 22:08:21 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-12-01 22:08:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6030
2020-12-01 22:08:22 [scrapy.core.engine] INFO: Spider opened
2020-12-01 22:08:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://feedback.aliexpress.com/display/productEvaluation.htm> (referer: None)
Solution
That URL contains characters (e.g. &
) that are special characters in some system shells (e.g. Bash).
If you are using a Linux shell, things will probably work if you put the URL between single quotes:
$ scrapy shell 'https://feedback.aliexpress.com/display/productEvaluation.htm?&productId=4000033626141&ownerMemberId=22'
Alternatively, once inside the Scrapy shell, you can use fetch(<URL as a Python string>)
to load a different URL.
Answered By - Gallaecio
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.