Sunday, January 30, 2022

[FIXED] Scrapy spider is repeating scraped data

January 30, 2022 data-science, python-3.x, scrapy No comments

Issue

from scrapy.spiders import Spider from ..items import QtItem

class QuoteSpider(Spider):
    name = 'acres'
    start_urls = ['any_url']

def parse(self, response):
    items = QtItem()

    all_div_names = response.xpath('//article')

    for bks in all_div_names:
        name = all_div_names.xpath('//span[@class="css-fwbz9r"]/text()').extract()
        price = all_div_names.xpath('//h2[@class="css-yr18fa"]/text()').extract()
        sqft = all_div_names.xpath('//div[@class="css-1ty8tu4"]/text()').extract()
        bhk = all_div_names.xpath('//a[@class="css-163eyf0"]/text()').extract()

    yield {
        'ttname': name,
        'ttprice': price,
        'ttsqft': sqft,
        'ttbhk': bhk
    }

the question has been answered

Solution

Corrections

Add in .// instead of // for each variable you're looping over
Use bks instead of all_div_names.
Use get() instead of extract() as it's one item within the span. get() grabs a single item, extract() grabs multiple items.
Your yield statement is not within the for loop. To yield each variable into the dictionary the yield statement needs be within the for loop.

eg. name = bks.xpath('.//span[@class="css-fwbz9r"]/text()').get()

Tips

.// traverses all child elements of all_div_names XPATH selector. Should always use .// when you're looping over an XPATH selector with multiple items such as all_div_names. eg name = bks.xpath('.//span[@class="css-fwbz9r"]/text()').get() You will access all span elements of bks in this XPATH selector by using .//.
use getall() instead of extract() and get() instead of extract_first(). With get() you will always get a string, with extract() you wont know if you're getting a list or string unfortunately!
Use an Items dictionary rather than yielding a dictionary. It's easier to do things like pipelines. That is a pipeline modifys data. Eg for modifying what Items will be outputted to a json file etc... A common pipeline is a duplicates pipeline which an example can be found on scrapy docs. You can drop certain items from the item dictionary if it's a duplicate piece of data using this pipeline. I almost never yield a dictionary for scraping projects unless the data is highly structured, requiring no modifications or there is no duplicate information extracted.
Consider using Scrapy's ItemLoaders for any scraping project where the data you're extracting requires simple modification eg clearing newlines, changing the extracted data slightly . You'll be surprised how often this is.

Code Example

def parse(self, response):
    items = QtItem()

    all_div_names = response.xpath('//article')

    for bks in all_div_names:
        name = bks.xpath('.//span[@class="css-fwbz9r"]/text()').get()
        price = bks.xpath('.//h2[@class="css-yr18fa"]/text()').get()
        sqft = bks.xpath('.//div[@class="css-1ty8tu4"]/text()').get()
        bhk = bks.xpath('.//a[@class="css-163eyf0"]/text()').get()

        yield {
            'ttname': name,
            'ttprice': price,
            'ttsqft': sqft,
            'ttbhk': bhk
              }

Answered By - AaronS

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 30, 2022

[FIXED] Scrapy spider is repeating scraped data

Issue

Solution

Corrections

Tips

Code Example

0 comments:

Post a Comment

Popular Posts

Labels