Issue
I am very new to python and scrapy.,my task is to download .PDF files for a specif topic. Ex : There was a more contracts in this site ** https://www.sec.gov/ ** currently i am downloading the files one by one. I have to write a scrapy program to download all related .PDF files using search key word like ** Keyword : Exhibit 10/ EXHIBIT 11 **
## My Code ##
#import urllib
import scrapy
from scrapy.http import Request
class pwc_tax(scrapy.Spider):
name = "pwc_tax"
allowed_domains = ["www.sec.gov"]
start_urls = ["https://www.sec.gov/cubist-pharmaceuticals-inc-exhibit-10-65-10-k405"]
def parse(self, response):
base_url = 'https://www.sec.gov/'
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
# self.logger.info(link)
if link.endswith('.pdf'):
#link = urllib.parse.urljoin(base_url, link)
link = base_url + link
self.logger.info(link)
yield Request(link, callback=self.save_pdf)
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving PDF %s', path)
with open(path, 'wb') as f:
f.write(response.body)
Using This scrapy code I can able to download PDF only in given URL. EX : https://www.sec.gov/cubist-pharmaceuticals-inc-exhibit-10-65-10-k405 (If I gave above URL the file has been downloading but for this I can download by manually, I have to download the whole PDF which was search item ) if I search using Exhibit 10 keyword the follwing page will appear https://secsearch.sec.gov/search?utf8=%3F&affiliate=secsearch&query=exhibit+10 and I want to scrapy to open all links and download all the pdf. If any one help me with solve this code. Thanks in Advance.
Solution
You should first take the search query url in start_urls
and from the response of start_url, extract all the urls and send request to each one of them. After that extract the pdf link and save it to your local storage.
The code will look something like this,
import scrapy
from scrapy.http import Request
class pwc_tax(scrapy.Spider):
name = "pwc_tax"
allowed_domains = ["www.sec.gov", 'search.usa.gov', 'secsearch.sec.gov']
start_urls = ["https://secsearch.sec.gov/search?utf8=%E2%9C%93&affiliate=secsearch&sort_by=&query=Exhibit+10%2F+EXHIBIT+11"]
def parse(self, response):
# extract search results
for link in response.xpath('//div[@id="results"]//h4[@class="title"]/a/@href').extract():
req = Request(url=link, callback=self.parse_page)
yield req
def parse_page(self, response):
# parse each search result here
pdf_files = response.xpath('//div[@class="article-file-download"]/a/@href').extract()
# base url wont be part of this pdf_files
# sample: [u'/files/18-03273-E.pdf']
# need to add at the beginning of each url
# response.urljoin() will do the task for you
for pdf in pdf_files:
if pdf.endswith('.pdf'):
pdf_url = response.urljoin(pdf)
req = Request(url=pdf_url, callback=self.save_pdf)
yield req
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving PDF %s', path)
with open(path, 'wb') as f:
f.write(response.body)
Answered By - Jithin
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.