Thursday, December 28, 2023

[FIXED] NO_CALLBACK is being called when should not

December 28, 2023 python, python-3.x, scrapy, web-scraping No comments

Issue

I'm writing a scrapy spider. In a callback method that yields requests with NO_CALLBACK set in callback

A parse callback yields new requets, with callback arg set to NO_CALLBACK, which indicates scrapy to don't call callback at all, but it's calling it and raising this error: RuntimeError: The NO_CALLBACK callback has been called. This is a special callback value intended for requests whose callback is never meant to be called.

Here is the code:

from scrapy import Spider
from scrapy.http import TextResponse
from scrapy.http.request import NO_CALLBACK


class AppsSpider(Spider):
    name = "Apps"
    allowed_domains = ['steampowered.com', 'steamstatic.com']
    start_urls = ['https://store.steampowered.com/app/20']

    def parse(self, response: TextResponse):
        # preview media
        preview_section = response.css('#game_highlights')
        main_image_selector = '.game_header_image_full::attr("src")'
        preview_img_selector = '.highlight_screenshot a::attr("href")'
        preview_videos_selector = '.highlight_movie::attr("data-mp4-hd-source")'
        links = preview_section.css(
            ', '.join([main_image_selector, preview_img_selector, preview_videos_selector])).getall()

        # description section media
        description_section = response.css('#aboutThisGame')
        description_img_gif_selector = 'img::attr("src")'
        links += description_section.css(description_img_gif_selector).getall()

        yield from response.follow_all(links, callback=NO_CALLBACK)

Tried to solve it removing the errback callback, cb_kwargs, meta, and dont_filter=True. None of it worked.

The doc says:

When assigned to the callback parameter of Request, it indicates that the request is not meant to have a spider callback at all.

Edit: Edited to be mre, you can run it using scrapy runspider nameofscript.py

Here's the traceback:

2023-04-02 21:12:11 [scrapy.core.scraper] ERROR: Spider error processing <GET https://cdn.cloudflare.steamstatic.com/steam/apps/20/0000000164.1920x1080.jpg?t=1579634708> (referer: https://store.steampowered.com/app/20)
Traceback (most recent call last):
  File "C:\Users\leiver\miniconda3\envs\steam-scraping\Lib\site-packages\twisted\internet\defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\leiver\miniconda3\envs\steam-scraping\Lib\site-packages\scrapy\http\request\__init__.py", line 40, in NO_CALLBACK
    raise RuntimeError(
RuntimeError: The NO_CALLBACK callback has been called. This is a special callback value intended for requests whose callback is never meant to be called.

Solution

There is nothing magic about the NO_CALLBACK. When you yield a request to the scrapy engine, by default it will always attempt to process the response using the default callback or the one specified using the callback parameter, this is true even when the callback is NO_CALLBACK. What NO_CALLBACK is supposed to do is act as a flag of sorts so that you can write custom middleware that listens for it, and treats it differently than standard scrapy requests.

If we look at the source code for the scrapy.http.requests.NO_CALLBACK you see:

def NO_CALLBACK(*args, **kwargs):
    """When assigned to the ``callback`` parameter of
    :class:`~scrapy.http.Request`, it indicates that the request is not meant
    to have a spider callback at all.
    For example:
    .. code-block:: python
       Request("https://example.com", callback=NO_CALLBACK)
    This value should be used by :ref:`components <topics-components>` that
    create and handle their own requests, e.g. through
    :meth:`scrapy.core.engine.ExecutionEngine.download`, so that downloader
    middlewares handling such requests can treat them differently from requests
    intended for the :meth:`~scrapy.Spider.parse` callback.
    """
    raise RuntimeError(
        "The NO_CALLBACK callback has been called. This is a special callback "
        "value intended for requests whose callback is never meant to be "
        "called."
    )

And if you search in the documentation the only example that can be found that uses the class never actually yields the request, and instead calls the scrapy engine directly.

docs example

request = scrapy.Request(screenshot_url, callback=NO_CALLBACK)
response = await maybe_deferred_to_future(
            spider.crawler.engine.download(request, spider)
        )

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, December 28, 2023

[FIXED] NO_CALLBACK is being called when should not

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels