Issue
page has 10 quotes i put them into a list its show all 10 .
but when i run the code to scrape it one quote is missing from output so there is only 9 rows of data .
( note ) i noticed that the one missing is one where the quote from same ( author ) not sure if that has anything to do with it .
page being scraped : https://quotes.toscrape.com/page/4
same happens with other pages
i have 2 functions one scrapes URLs and some basic info about the quote then follows that URLs to scrape data on the author and create a dict there .
code :
def parse(self, response):
qs = response.css('.quote')
for q in qs:
n = {}
page_url = q.css('span a').attrib['href']
full_page_url = 'https://quotes.toscrape.com' + page_url
# tags
t = []
tags = q.css('.tag')
for tag in tags:
t.append(tag.css('::text').get())
# items
n['quote'] = q.css('.text ::text').get(),
n['tag'] = t,
n['author'] = q.css('span .author ::text').get(),
yield response.follow(full_page_url, callback=self.parse_page, meta={'item': n})
def parse_page(self, response):
q = response.css('.author-details')
item = response.meta.get('item')
yield {
'text': item['quote'],
'author': item['author'],
'tags': item['tag'],
'date': q.css('p .author-born-date ::text').get(),
'location': q.css('p .author-born-location ::text').get(),
}
i also tried using items ( scrapy fields) same thing
and i tried debugging and print data from first function the missing row showes there but it doesn't get sent to second function .
so i triend diffrent methods of sending dict with first info the the second one . i tried cb_kwargs : yield response.follow(full_page_url, callback=self.parse_page, cb_kwargs={'item':n})
Solution
Scrapy has a built in duplicate filter, which automatically ignores duplicate urls, so when you have two quotes from the same author, both of those quotes target the same url for the author details, which means when it reaches the second occurence of the url it ignores that request and that item is never yielded to the output feed processors.
You can fix this by setting the dont_filter
parameter to True
in your requests.
For example:
def parse(self, response):
for q in response.css('.quote'):
n = {}
n["tags"] = q.css('.tag::text').getall()
n['quote'] = q.css('.text ::text').get().strip()
n['author'] = q.css('span .author ::text').get().strip()
page_url = q.css('span a').attrib['href']
yield response.follow(page_url, callback=self.parse_page, meta={'item': n}, dont_filter=True)
def parse_page(self, response):
q = response.css('.author-details')
item = response.meta.get('item')
item["date"] = q.css('p .author-born-date ::text').get()
item["location"] = q.css('p .author-born-location ::text').get()
yield item
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.