Issue
I am writing a scrapy spider that takes as input many urls and classifies them into categories (returned as items). These URLs are fed to the spider via my crawler's start_requests()
method.
Some URLs can be classified without downloading them, so I would like to yield
directly an Item
for them in start_requests()
, which is forbidden by scrapy. How can I circumvent this?
I have thought about catching these requests in a custom middleware that would turn them into spurious Response
objects, that I could then convert into Item
objects in the request callback, but any cleaner solution would be welcome.
Solution
I think using a spider middleware and overwriting the start_requests() would be a good start.
In your middleware, you should loop over all urls in start_urls, and could use conditional statements to deal with different types of urls.
- For your special URLs which do not require a request, you can
- directly call your pipeline's process_item(), do not forget to import your pipeline and create a scrapy.item from your url for this
- as you mentioned, pass the url as meta in a Request, and have a separate parse function which would only return the url
- For all remaining URLs, your can launch a "normal" Request as you probably already have defined
Answered By - Ruehri
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.