Thursday, October 28, 2021

[FIXED] Download all related .PDF file for specif topic with depth

October 28, 2021 python, scrapy, scrapy-spider, web-crawler, web-scraping No comments

Issue

I am very new to python and scrapy.,my task is to download .PDF files for a specif topic. Ex : There was a more contracts in this site ** https://www.sec.gov/ ** currently i am downloading the files one by one. I have to write a scrapy program to download all related .PDF files using search key word like ** Keyword : Exhibit 10/ EXHIBIT 11 **

## My Code ##

    #import urllib
import scrapy

from scrapy.http import Request

class pwc_tax(scrapy.Spider):
  name = "pwc_tax"

  allowed_domains = ["www.sec.gov"]
  start_urls = ["https://www.sec.gov/cubist-pharmaceuticals-inc-exhibit-10-65-10-k405"]

  def parse(self, response):
    base_url = 'https://www.sec.gov/'

    for a in response.xpath('//a[@href]/@href'):
        link = a.extract()
        # self.logger.info(link)

        if link.endswith('.pdf'):
            #link = urllib.parse.urljoin(base_url, link)
        link = base_url + link
            self.logger.info(link)
            yield Request(link, callback=self.save_pdf)

  def save_pdf(self, response):
    path = response.url.split('/')[-1]
    self.logger.info('Saving PDF %s', path)
    with open(path, 'wb') as f:
        f.write(response.body)

Using This scrapy code I can able to download PDF only in given URL. EX : https://www.sec.gov/cubist-pharmaceuticals-inc-exhibit-10-65-10-k405 (If I gave above URL the file has been downloading but for this I can download by manually, I have to download the whole PDF which was search item ) if I search using Exhibit 10 keyword the follwing page will appear https://secsearch.sec.gov/search?utf8=%3F&affiliate=secsearch&query=exhibit+10 and I want to scrapy to open all links and download all the pdf. If any one help me with solve this code. Thanks in Advance.

Solution

You should first take the search query url in start_urls and from the response of start_url, extract all the urls and send request to each one of them. After that extract the pdf link and save it to your local storage.

The code will look something like this,

import scrapy

from scrapy.http import Request


class pwc_tax(scrapy.Spider):
    name = "pwc_tax"

    allowed_domains = ["www.sec.gov", 'search.usa.gov', 'secsearch.sec.gov']
    start_urls = ["https://secsearch.sec.gov/search?utf8=%E2%9C%93&affiliate=secsearch&sort_by=&query=Exhibit+10%2F+EXHIBIT+11"]

    def parse(self, response):
        # extract search results
        for link in response.xpath('//div[@id="results"]//h4[@class="title"]/a/@href').extract():
            req = Request(url=link, callback=self.parse_page)
            yield req

    def parse_page(self, response):
        # parse each search result here
        pdf_files = response.xpath('//div[@class="article-file-download"]/a/@href').extract()
        # base url wont be part of this pdf_files
        # sample: [u'/files/18-03273-E.pdf']
        # need to add at the beginning of each url
        # response.urljoin() will do the task for you
        for pdf in pdf_files:
            if pdf.endswith('.pdf'):
                pdf_url = response.urljoin(pdf)
                req = Request(url=pdf_url, callback=self.save_pdf)
                yield req

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        with open(path, 'wb') as f:
            f.write(response.body)

Answered By - Jithin

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, October 28, 2021

[FIXED] Download all related .PDF file for specif topic with depth

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels