Issue
I'm writing a scrapy spider. In a callback method that yields requests with NO_CALLBACK set in callback
A parse callback yields new requets, with callback arg set to NO_CALLBACK, which indicates scrapy to don't call callback at all, but it's calling it and raising this error:
RuntimeError: The NO_CALLBACK callback has been called. This is a special callback value intended for requests whose callback is never meant to be called.
Here is the code:
from scrapy import Spider
from scrapy.http import TextResponse
from scrapy.http.request import NO_CALLBACK
class AppsSpider(Spider):
name = "Apps"
allowed_domains = ['steampowered.com', 'steamstatic.com']
start_urls = ['https://store.steampowered.com/app/20']
def parse(self, response: TextResponse):
# preview media
preview_section = response.css('#game_highlights')
main_image_selector = '.game_header_image_full::attr("src")'
preview_img_selector = '.highlight_screenshot a::attr("href")'
preview_videos_selector = '.highlight_movie::attr("data-mp4-hd-source")'
links = preview_section.css(
', '.join([main_image_selector, preview_img_selector, preview_videos_selector])).getall()
# description section media
description_section = response.css('#aboutThisGame')
description_img_gif_selector = 'img::attr("src")'
links += description_section.css(description_img_gif_selector).getall()
yield from response.follow_all(links, callback=NO_CALLBACK)
Tried to solve it removing the errback callback, cb_kwargs, meta, and dont_filter=True
. None of it worked.
The doc says:
When assigned to the callback parameter of Request, it indicates that the request is not meant to have a spider callback at all.
Edit: Edited to be mre, you can run it using scrapy runspider nameofscript.py
Here's the traceback:
2023-04-02 21:12:11 [scrapy.core.scraper] ERROR: Spider error processing <GET https://cdn.cloudflare.steamstatic.com/steam/apps/20/0000000164.1920x1080.jpg?t=1579634708> (referer: https://store.steampowered.com/app/20)
Traceback (most recent call last):
File "C:\Users\leiver\miniconda3\envs\steam-scraping\Lib\site-packages\twisted\internet\defer.py", line 892, in _runCallbacks
current.result = callback( # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\leiver\miniconda3\envs\steam-scraping\Lib\site-packages\scrapy\http\request\__init__.py", line 40, in NO_CALLBACK
raise RuntimeError(
RuntimeError: The NO_CALLBACK callback has been called. This is a special callback value intended for requests whose callback is never meant to be called.
Solution
There is nothing magic about the NO_CALLBACK
. When you yield a request to the scrapy engine, by default it will always attempt to process the response using the default callback or the one specified using the callback parameter, this is true even when the callback is NO_CALLBACK
. What NO_CALLBACK
is supposed to do is act as a flag of sorts so that you can write custom middleware that listens for it, and treats it differently than standard scrapy requests.
If we look at the source code for the scrapy.http.requests.NO_CALLBACK
you see:
def NO_CALLBACK(*args, **kwargs):
"""When assigned to the ``callback`` parameter of
:class:`~scrapy.http.Request`, it indicates that the request is not meant
to have a spider callback at all.
For example:
.. code-block:: python
Request("https://example.com", callback=NO_CALLBACK)
This value should be used by :ref:`components <topics-components>` that
create and handle their own requests, e.g. through
:meth:`scrapy.core.engine.ExecutionEngine.download`, so that downloader
middlewares handling such requests can treat them differently from requests
intended for the :meth:`~scrapy.Spider.parse` callback.
"""
raise RuntimeError(
"The NO_CALLBACK callback has been called. This is a special callback "
"value intended for requests whose callback is never meant to be "
"called."
)
And if you search in the documentation the only example that can be found that uses the class never actually yields the request, and instead calls the scrapy engine directly.
request = scrapy.Request(screenshot_url, callback=NO_CALLBACK)
response = await maybe_deferred_to_future(
spider.crawler.engine.download(request, spider)
)
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.