Issue
Excuse the way I ask this question but how can we add field choices (i.e: Django Field Choices), or how can we force a list of keywords (i.e: List of Countries) to the given field?
I want to scrape data from bunch of different websites and I can fairly clean the data as in general up to an extend. However, what I need is a way to force the Items class fields to accept certain values, if the value is not in the list raise an error.
For example:
I have a field name SourceCountry = Field(). I know that I can set a rule to accept only string values Field(serializer=str). So now, I can at least avoid values in other data types.
Now, let's say that I cleaned the scraped country data and formatted into what I'm expecting as a country data. The value I'm storing is 'USA' and in my list that I want to use as field choices I have 'USA' too. Perfect! I can save this scraped data. So, on the other hand, if the data is for example 'glass', obviously, this won't be in the list and Items should raise an error.
As far as I can imagine, I can just create a bunch of list that I want to use as data field choices and compare my result against it before storing it in the Items.
Is there a better solution? More professional?
So, I'm open to any suggestions.
Thanks.
Solution
You can subclass the scrapy.Item
class and add some filtering methods that check for unwanted values.
For example:
items.py
from scrapy import Item, Field
class QuoteItem(Item):
text = Field()
source = Field()
tags = Field()
def check_source(self, value):
if value not in ["J.K. Rowling", "Albert Einstein", "Dr. Seuss"]:
return self["source"] = ""
self["source"] = value
def check_text(self, value):
self["text"] = value
def check_tags(self, lst):
if "religion" in lst:
return self["tags"] = ""
self["tags"] = lst
quotes.py
import scrapy
from ..items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
for quote in response.xpath('//div[@class="quote"]'):
item = QuoteItem()
item.check_text(quote.xpath('./span[@class="text"]/text()').get())
item.check_source(quote.xpath('.//small[@class="author"]/text()').get())
item.check_tags(quote.xpath('.//a[@class="tag"]/text()').getall())
yield item
next_page = response.xpath('//li[@class="next"]/a/@href').get()
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.