Tuesday, July 5, 2022

[FIXED] How does a CrawlSpider know what to do with a yielded Request object?

July 05, 2022 python, scrapy No comments

Issue

In the Scrapy docs, this code is used to illustrate how to pass information to a callback function. My question is, how does the CrawlSpider class this code is within know to execute the yielded request object? Is that simply coded behavior? Additionally, is yield used instead of return to keep the function running and ready to accept more Response objects in case multiple urls are being scraped? Would returning a Request object work just as well if only 1 url was being scraped? I apologize if these are basic questions, I haven't used Python or Scrapy before.

def parse(self, response):
    request = scrapy.Request('http://www.example.com/index.html',
                             callback=self.parse_page2,
                             cb_kwargs=dict(main_url=response.url))
    request.cb_kwargs['foo'] = 'bar'  # add more arguments for the callback
    yield request

def parse_page2(self, response, main_url, foo):
    yield dict(
        main_url=main_url,
        other_url=response.url,
        foo=foo,
    )

Solution

The scrapy framework intentionally abstracts a lot of complexity to make coding crawlers seemingly trivial for users. As such there are many features that might seem like "magic". In your example the CrawlSpider knows to execute the yielded request because the method that called the your custom parse method is programmed to expect as much. You may have also noticed that you never actually call any of the parse method(s) directly. They are called by the scrapy internal engine, so when you yield back the results of the parse method, those results are passed back to the caller and processed internally. Once it is received, it goes through various processes to check if the returned object was an Item like object, in which case it passes it off to other middleware and Item pipelines, or another Request object, in which case it adds it to the internal Scheduler queue of requests that have yet to be processed. It might seem unusual, especially if you are only used to dealing with libraries where you as the developer are responsible of piecing everything together, but it isn't necessarily uncommon in larger frameworks.

The yield statement enables a more flexible means of iteration and creates generator objects. Instead of having to evaluate an iterable all at once, it allows you to produce one result at a time and pass it to some other process; afterward it may or may not jump back and move to the next item. In many cases yielding a singular item would do the same thing as returning it. There are some exceptions though, for example if there is any remaining clean up code that takes place after the yield statement. For this reason, when working with frameworks and API's like scrapy, I suggest you stick with whatever the documentation recommends.

Answered By - alexpdev

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, July 5, 2022

[FIXED] How does a CrawlSpider know what to do with a yielded Request object?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels