Issue
In the Scrapy docs, this code is used to illustrate how to pass information to a callback function. My question is, how does the CrawlSpider
class this code is within know to execute the yielded request object? Is that simply coded behavior? Additionally, is yield
used instead of return
to keep the function running and ready to accept more Response
objects in case multiple urls are being scraped? Would returning a Request
object work just as well if only 1 url was being scraped? I apologize if these are basic questions, I haven't used Python or Scrapy before.
def parse(self, response):
request = scrapy.Request('http://www.example.com/index.html',
callback=self.parse_page2,
cb_kwargs=dict(main_url=response.url))
request.cb_kwargs['foo'] = 'bar' # add more arguments for the callback
yield request
def parse_page2(self, response, main_url, foo):
yield dict(
main_url=main_url,
other_url=response.url,
foo=foo,
)
Solution
The scrapy framework intentionally abstracts a lot of complexity to make coding crawlers seemingly trivial for users. As such there are many features that might seem like "magic". In your example the CrawlSpider
knows to execute the yielded request because the method that called the your custom parse method is programmed to expect as much. You may have also noticed that you never actually call any of the parse method(s) directly. They are called by the scrapy internal engine, so when you yield back the results of the parse method, those results are passed back to the caller and processed internally. Once it is received, it goes through various processes to check if the returned object was an Item
like object, in which case it passes it off to other middleware and Item pipelines, or another Request
object, in which case it adds it to the internal Scheduler
queue of requests that have yet to be processed. It might seem unusual, especially if you are only used to dealing with libraries where you as the developer are responsible of piecing everything together, but it isn't necessarily uncommon in larger frameworks.
The yield
statement enables a more flexible means of iteration and creates generator
objects. Instead of having to evaluate an iterable all at once, it allows you to produce one result at a time and pass it to some other process; afterward it may or may not jump back and move to the next item. In many cases yield
ing a singular item would do the same thing as returning it. There are some exceptions though, for example if there is any remaining clean up code that takes place after the yield statement. For this reason, when working with frameworks and API's like scrapy, I suggest you stick with whatever the documentation recommends.
Answered By - alexpdev
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.