Tuesday, December 5, 2023

[FIXED] scrapy output missing row in from the page

December 05, 2023 dictionary, function, scrapy, web-scraping No comments

Issue

page has 10 quotes i put them into a list its show all 10 .

but when i run the code to scrape it one quote is missing from output so there is only 9 rows of data .

( note ) i noticed that the one missing is one where the quote from same ( author ) not sure if that has anything to do with it .

page being scraped : https://quotes.toscrape.com/page/4
same happens with other pages

i have 2 functions one scrapes URLs and some basic info about the quote then follows that URLs to scrape data on the author and create a dict there .

code :

def parse(self, response):
    qs = response.css('.quote')
    for q in qs:
        n = {}
        page_url = q.css('span a').attrib['href']
        full_page_url = 'https://quotes.toscrape.com' + page_url

        # tags
        t = []
        tags = q.css('.tag')
        for tag in tags:
            t.append(tag.css('::text').get())

        # items
        n['quote'] = q.css('.text ::text').get(),
        n['tag'] = t,
        n['author'] = q.css('span .author ::text').get(),
        yield response.follow(full_page_url, callback=self.parse_page, meta={'item': n})



def parse_page(self, response):
    q = response.css('.author-details')
    item = response.meta.get('item')
    yield {
        'text': item['quote'],
        'author': item['author'],
        'tags': item['tag'],
        'date': q.css('p .author-born-date ::text').get(),
        'location':  q.css('p .author-born-location ::text').get(),
    }

i also tried using items ( scrapy fields) same thing

and i tried debugging and print data from first function the missing row showes there but it doesn't get sent to second function .

so i triend diffrent methods of sending dict with first info the the second one . i tried cb_kwargs : yield response.follow(full_page_url, callback=self.parse_page, cb_kwargs={'item':n})

Solution

Scrapy has a built in duplicate filter, which automatically ignores duplicate urls, so when you have two quotes from the same author, both of those quotes target the same url for the author details, which means when it reaches the second occurence of the url it ignores that request and that item is never yielded to the output feed processors.

You can fix this by setting the dont_filter parameter to True in your requests.

For example:

def parse(self, response):
    for q in response.css('.quote'):
        n = {}
        n["tags"] = q.css('.tag::text').getall()
        n['quote'] = q.css('.text ::text').get().strip()
        n['author'] = q.css('span .author ::text').get().strip()
        page_url = q.css('span a').attrib['href']
        yield response.follow(page_url, callback=self.parse_page, meta={'item': n}, dont_filter=True)



def parse_page(self, response):
    q = response.css('.author-details')
    item = response.meta.get('item')
    item["date"] = q.css('p .author-born-date ::text').get()
    item["location"] = q.css('p .author-born-location ::text').get()
    yield item

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, December 5, 2023

[FIXED] scrapy output missing row in from the page

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels