Issue
I understand in scrapy we can define custom Items or just return simple Python dict. I Scrapy documentation There's a specific page for Item Loaders which says:
Item Loaders provide a convenient mechanism for populating scraped items. Even though items can be populated directly, Item Loaders provide a much more convenient API for populating them from a scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.
Also, in the next section when explaining Item Pipelines there's an example which uses Item Adapters to cleanup price:
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem
class PricePipeline:
vat_factor = 1.15
def process_item(self, item, spider):
adapter = ItemAdapter(item)
if adapter.get("price"):
if adapter.get("price_excludes_vat"):
adapter["price"] = adapter["price"] * self.vat_factor
return item
else:
raise DropItem(f"Missing price in {item}")
Why didn't they just use Item Loaders and declare a processor to cleanup or override the serializer method to cleanup price?
I just don't understand the difference between Item Loaders and Item Adapters. I also can't seem to find a good piece of documentation for any of them or any blog post or stackoverflow question to delineate between the two.
Solution
They are a bit confusing, I agree. But they have different purposes:
Item loaders provide you with an API for (almost/sort of) declaratively stating how the properties of your entity will be extracted from the response. So, they are essentially builders for your entities. This way you can improve code readability and have some helpers for extracting data.
Item adapters provide you with an API for accessing data from a container object in a standardized way, either if you are using Item classes, Data classes, or just dictionaries, you can use adapters to have a single, dict-like way of accessing data from objects. They are more like wrappers. A common scenario is when you have a code base that parses responses to different types of objects but you don't want to handle those types in item pipelines. So you use adapters and your Item Pipelines can ignore what type of object was returned from your spiders.
Unfortunately, there is not much documentation on both of them (I guess you can find out more about loaders, but idk). However, they are separate components since they handle different problems and you may want to use both, one, or neither of them ...
Hope this helps :)
Answered By - Leandro Rodrigues de Souza
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.