Monday, July 25, 2022

[FIXED] Remove duplicate value using scrapy

July 25, 2022 python, scrapy, web-scraping No comments

Issue

There are 695 record in page but they gave 954 record so there are duplicate value in it so how I remove duplicate value so they gave me only 695 record these is page link http://www.palatakd.ru/list/

import scrapy
from scrapy.http import Request

class PushpaSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['http://www.palatakd.ru/list/']
    page_number=1
   
    
    def parse(self, response):
        details=response.xpath("//p[@class='detail_block']")
        for detail in details:
            registration=detail.xpath(".//span[contains(.,'Регистрационный номер адвоката в реестре')]//following-sibling::span//text()").get()
            address=detail.xpath(".//span[contains(.,'Адрес')]//following-sibling::span//text()").get()
            phone=detail.xpath(".//span[contains(.,'Телефон')]//following-sibling::span//text()").get()
            fax=detail.xpath(".//span[contains(.,'Факс')]//following-sibling::span//text()").get()
            yield{
                'Телефон':phone,
                'Факс':fax,
                'Регистрационный номер адвоката в реестре':registration,
                'Адрес':address
            
            }
            next_page = 'http://www.palatakd.ru/list/?PAGEN_1=' + str(PushpaSpider.page_number)
            
            if PushpaSpider.page_number<=3:
                PushpaSpider.page_number += 1
                yield response.follow(next_page, callback = self.parse)

Solution

You can enable your item pipeline to filter out duplicates.

for example:

In your settings.py file turn on (uncomment) your ITEM_PIPELINES

ITEM_PIPELINES = {
   'project.pipelines.ProjectPipeline': 300,
}

in your pipelines.py file filter out the duplicate items.

from scrapy.exceptions import DropItem

class ProjectPipeline:
    itemlist = []

    def process_item(self, item, spider):
        if item in self.itemlist:
            raise DropItem
        self.itemlist.append(item)
        return item

No adjustments need to be made to your spider.

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, July 25, 2022

[FIXED] Remove duplicate value using scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels