Issue
I have a problem with my script such that the same file name, and pdf is downloading. I have checked the output of my results without downloadfile and I get unique data. It's when I use the pipeline that it somehow produces duplicates for download.
Here's my script:
import scrapy
from environment.items import fcpItem
class fscSpider(scrapy.Spider):
name = 'fsc'
start_urls = ['https://fsc.org/en/members']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
callback = self.parse
)
def parse(self, response):
content = response.xpath("(//div[@class='content__wrapper field field--name-field-content field--type-entity-reference-revisions field--label-hidden field__items']/div[@class='content__item even field__item'])[position() >1]")
loader = fcpItem()
names_add = response.xpath(".//div[@class = 'field__item resource-item']/article//span[@class='media-caption file-caption']/text()").getall()
url = response.xpath(".//div[@class = 'field__item resource-item']/article/div[@class='actions']/a//@href").getall()
pdf=[response.urljoin(x) for x in url if '#' is not x]
names = [x.split(' ')[0] for x in names_add]
for nm, pd in zip(names, pdf):
loader['names'] = nm
loader['pdfs'] = [pd]
yield loader
items.py
class fcpItem(scrapy.Item):
names = Field()
pdfs = Field()
results = Field()
pipelines.py
class DownfilesPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, item=None):
items = item['names']+'.pdf'
return items
settings.py
from pathlib import Path
import os
BASE_DIR = Path(__file__).resolve().parent.parent
FILES_STORE = os.path.join(BASE_DIR, 'fsc')
ROBOTSTXT_OBEY = False
FILES_URLS_FIELD = 'pdfs'
FILES_RESULT_FIELD = 'results'
ITEM_PIPELINES = {
'environment.pipelines.pipelines.DownfilesPipeline': 150
}
Solution
The problem is that you are overwriting the same scrapy item every iteration.
What you need to do is create a new item for each time your parse method yields. I have tested this and confirmed that it does produce the results you desire.
I made and inline not in my example below on the line that needs to be changed.
For example:
import scrapy
from environment.items import fcpItem
class fscSpider(scrapy.Spider):
name = 'fsc'
start_urls = ['https://fsc.org/en/members']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
callback = self.parse
)
def parse(self, response):
content = response.xpath("(//div[@class='content__wrapper field field--name-field-content field--type-entity-reference-revisions field--label-hidden field__items']/div[@class='content__item even field__item'])[position() >1]")
names_add = response.xpath(".//div[@class = 'field__item resource-item']/article//span[@class='media-caption file-caption']/text()").getall()
url = response.xpath(".//div[@class = 'field__item resource-item']/article/div[@class='actions']/a//@href").getall()
pdf=[response.urljoin(x) for x in url if '#' is not x]
names = [x.split(' ')[0] for x in names_add]
for nm, pd in zip(names, pdf):
loader = fcpItem() # Here you create a new item each iteration
loader['names'] = nm
loader['pdfs'] = [pd]
yield loader
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.