Monday, December 6, 2021

[FIXED] scrapy to get into next page and download all files

December 06, 2021 python, scrapy, scrapy-spider, web-crawler, web-scraping No comments

Issue

I am new to scrapy and python, I am able to get details from URL, I want enter into link and download all files(.htm and .txt).

My Code

import scrapy

class legco(scrapy.Spider):
name = "sec_gov"

start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]

def parse(self, response):
    for link in response.xpath('//table[@summary="Results"]//td[@scope="row"]/a/@href').extract():
        absoluteLink = response.urljoin(link)
        yield scrapy.Request(url = absoluteLink, callback = self.parse_page)

def parse_page(self, response):
    for links in response.xpath('//table[@summary="Results"]//a[@id="documentsbutton"]/@href').extract():
        targetLink = response.urljoin(links)
        yield {"links":targetLink}

And I need to enter into link and download all the files with ends with .htm and .txt files. Below code is not working..

if link.endswith('.htm'):
    link = urlparse.urljoin(base_url, link)
    req = Request(link, callback=self.save_pdf)
    yield req                                                       

def save_pdf(self, response):
    path = response.url.split('/')[-1]
    with open(path, 'wb') as f:
        f.write(response.body)

Can Anyone help me with this ? Thanks in Advance.

Solution

Try the following to get the files downloaded in your desktop or wherever you mention within the script:

import scrapy, os

class legco(scrapy.Spider):
    name = "sec_gov"

    start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]

    def parse(self, response):
        for link in response.xpath('//table[@summary="Results"]//td[@scope="row"]/a/@href').extract():
            absoluteLink = response.urljoin(link)
            yield scrapy.Request(url = absoluteLink, callback = self.parse_links)

    def parse_links(self, response):
        for links in response.xpath('//table[@summary="Results"]//a[@id="documentsbutton"]/@href').extract():
            targetLink = response.urljoin(links)
            yield scrapy.Request(url = targetLink, callback = self.collecting_file_links)

    def collecting_file_links(self, response):
        for links in response.xpath('//table[contains(@summary,"Document")]//td[@scope="row"]/a/@href').extract():
            if links.endswith(".htm") or links.endswith(".txt"):
                baseLink = response.urljoin(links)
                yield scrapy.Request(url = baseLink, callback = self.download_files)

    def download_files(self, response):
        path = response.url.split('/')[-1]
        dirf = r"C:\Users\WCS\Desktop\Storage"
        if not os.path.exists(dirf):os.makedirs(dirf)
        os.chdir(dirf)
        with open(path, 'wb') as f:
            f.write(response.body)

To be clearer: you need to specify explicitly dirf = r"C:\Users\WCS\Desktop\Storage" where C:\Users\WCS\Desktop or something will be your desired location. However, the script will automatically create Storage folder to save those files within.

Answered By - SIM

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 6, 2021

[FIXED] scrapy to get into next page and download all files

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels