Wednesday, August 31, 2022

[FIXED] How to Define Scrapy Field Choices?

August 31, 2022 python, scrapy, web-crawler No comments

Issue

Excuse the way I ask this question but how can we add field choices (i.e: Django Field Choices), or how can we force a list of keywords (i.e: List of Countries) to the given field?

I want to scrape data from bunch of different websites and I can fairly clean the data as in general up to an extend. However, what I need is a way to force the Items class fields to accept certain values, if the value is not in the list raise an error.

For example:

I have a field name SourceCountry = Field(). I know that I can set a rule to accept only string values Field(serializer=str). So now, I can at least avoid values in other data types.

Now, let's say that I cleaned the scraped country data and formatted into what I'm expecting as a country data. The value I'm storing is 'USA' and in my list that I want to use as field choices I have 'USA' too. Perfect! I can save this scraped data. So, on the other hand, if the data is for example 'glass', obviously, this won't be in the list and Items should raise an error.

As far as I can imagine, I can just create a bunch of list that I want to use as data field choices and compare my result against it before storing it in the Items.

Is there a better solution? More professional?

So, I'm open to any suggestions.

Thanks.

Solution

You can subclass the scrapy.Item class and add some filtering methods that check for unwanted values.

For example:

items.py

from scrapy import Item, Field

class QuoteItem(Item):
    text = Field()
    source = Field()
    tags = Field()

    def check_source(self, value):
        if value not in ["J.K. Rowling", "Albert Einstein", "Dr. Seuss"]:
            return self["source"] = ""
        self["source"] = value

    def check_text(self, value):
        self["text"] = value

    def check_tags(self, lst):
        if "religion" in lst:
            return self["tags"] = ""
        self["tags"] = lst

quotes.py

import scrapy
from ..items import QuoteItem
class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            item = QuoteItem()
            item.check_text(quote.xpath('./span[@class="text"]/text()').get())
            item.check_source(quote.xpath('.//small[@class="author"]/text()').get())
            item.check_tags(quote.xpath('.//a[@class="tag"]/text()').getall())
            yield item
        next_page = response.xpath('//li[@class="next"]/a/@href').get()
        yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, August 31, 2022

[FIXED] How to Define Scrapy Field Choices?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels