Issue
I was learning Scrapy framework. I tried to use scrapy shell. There I was trying to fetch response from "https://quotes.toscrape.com/". The commands are below-
python -m scrapy shell
Inside the shell-
>> from scrapy import Request
>> req = Request("https://quotes.toscrape.com/")
>> fetch(req)
Then I found the error like this-
PS D:\Projects\scrapyLearn\introSpider\introSpider> python -m scrapy shell
2022-11-30 15:04:52 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: introSpider)
2022-11-30 15:04:52 [scrapy.utils.log] INFO: Versions: lxml 4.9.0.0, libxml2 2.9.10, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.0, Twisted 22.10.0, Python 3.11.0 (main, Oct 24 2022, 18:26:48) [MSC v.1933 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.4, Platform Windows-10-10.0.22000-SP0
2022-11-30 15:04:52 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'introSpider',
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'LOGSTATS_INTERVAL': 0,
'NEWSPIDER_MODULE': 'introSpider.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['introSpider.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2022-11-30 15:04:52 [asyncio] DEBUG: Using selector: SelectSelector
2022-11-30 15:04:52 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2022-11-30 15:04:52 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop2022-11-30 15:04:52 [scrapy.extensions.telnet] INFO: Telnet Password: 9ec5c326bbb22c54
2022-11-30 15:04:52 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2022-11-30 15:04:52 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-11-30 15:04:52 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-11-30 15:04:52 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-11-30 15:04:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x000002601B1B48D0>
[s] item {}
[s] settings <scrapy.settings.Settings object at 0x000002601B3EC550>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>> from scrapy import Request
>>> req = Request("https://quotes.toscrape.com/")
>>> fetch(req)
2022-11-30 15:05:46 [scrapy.core.engine] INFO: Spider opened
2022-11-30 15:05:47 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None)
2022-11-30 15:05:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/> (referer: None)
>>> 2022-11-30 15:05:47 [scrapy.core.scraper] ERROR: Spider error processing <GET https://quotes.toscrape.com/> (referer: None)
Traceback (most recent call last):
File "C:\Users\arnoLiono\AppData\Local\Programs\Python\Python311\Lib\site-packages\twisted\internet\defer.py", line 892, in _runCallbacks
current.result = callback( # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\arnoLiono\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\utils\defer.py", line 285, in f
return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\arnoLiono\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\utils\defer.py", line 272, in deferred_from_coro
event_loop = get_asyncio_event_loop_policy().get_event_loop()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\arnoLiono\AppData\Local\Programs\Python\Python311\Lib\asyncio\events.py", line 677, in get_event_loop
raise RuntimeError('There is no current event loop in thread %r.'
RuntimeError: There is no current event loop in thread 'Thread-1 (start)'.
2022-11-30 15:05:47 [py.warnings] WARNING: C:\Users\arnoLiono\AppData\Local\Programs\Python\Python311\Lib\site-packages\twisted\internet\defer.py:892: RuntimeWarning: coroutine 'SpiderMiddlewareManager.scrape_response.<locals>.process_callback_output' was never awaited
current.result = callback( # type: ignore[misc]
And the shell is still running. I don't know what is error is. And how to fix it.
I was just trying to get the response from "https://quotes.toscrape.com/" website.
Solution
I recreated the same steps and had no problem getting the page. I would recommend you to change this setting in the settings.py:
ROBOTSTXT_OBEY = False
because as you can see in the logs, scrapy receives a 404 (error) when making a first request to https://quotes.toscrape.com/robots.txt that doesnt exists.
I would also recommend that you use fetch
directly with the url as an argument, example: fetch("https://quotes.toscrape.com/")
Answered By - Victor Lozoya
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.