Issue
There are 695 record in page
but they gave 954 record
so there are duplicate value in it so how I remove duplicate value so they gave me only 695 record
these is page link http://www.palatakd.ru/list/
import scrapy
from scrapy.http import Request
class PushpaSpider(scrapy.Spider):
name = 'test'
start_urls = ['http://www.palatakd.ru/list/']
page_number=1
def parse(self, response):
details=response.xpath("//p[@class='detail_block']")
for detail in details:
registration=detail.xpath(".//span[contains(.,'Регистрационный номер адвоката в реестре')]//following-sibling::span//text()").get()
address=detail.xpath(".//span[contains(.,'Адрес')]//following-sibling::span//text()").get()
phone=detail.xpath(".//span[contains(.,'Телефон')]//following-sibling::span//text()").get()
fax=detail.xpath(".//span[contains(.,'Факс')]//following-sibling::span//text()").get()
yield{
'Телефон':phone,
'Факс':fax,
'Регистрационный номер адвоката в реестре':registration,
'Адрес':address
}
next_page = 'http://www.palatakd.ru/list/?PAGEN_1=' + str(PushpaSpider.page_number)
if PushpaSpider.page_number<=3:
PushpaSpider.page_number += 1
yield response.follow(next_page, callback = self.parse)
Solution
You can enable your item pipeline to filter out duplicates.
for example:
In your settings.py file turn on (uncomment) your ITEM_PIPELINES
ITEM_PIPELINES = {
'project.pipelines.ProjectPipeline': 300,
}
in your pipelines.py file filter out the duplicate items.
from scrapy.exceptions import DropItem
class ProjectPipeline:
itemlist = []
def process_item(self, item, spider):
if item in self.itemlist:
raise DropItem
self.itemlist.append(item)
return item
No adjustments need to be made to your spider.
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.