Tuesday, January 9, 2024

[FIXED] Scraping specific pdfs from different websites

January 09, 2024 html, pdf-scraping, python, spyder, web-scraping No comments

Issue

First question here. I need to download a specific pdf from every url. I need just the pdf of the european commission proposal from each url that I have, which is always in a specific part of the page

[Here the part from the website that I would always need in pdf form]. The European Commission proposal

And here is the html code of it (The part that is interesting for me is :

"http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" is the pdf that I need, as you can see from the image )

 [<a class="externalDocument" href="http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" target="externalDocument">COM(2020)0791</a>, <a href="http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" target="_blank">
 <span class="ep_name">
 COM(2020)0791
                </span>
 <span class="ep_icon"> </span>
 </a>, <a href="http://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2020/0791/COM_COM(2020)0791_EN.pdf" target="_blank">
 <span class="ep_name">
 COM(2020)0791
                </span>
<span class="ep_icon"> </span>
</a>]

I used the subsequent code for the task, so that it takes every url from my csv file and it goes in each page to download every pdf. The problem is that with this approach it takes also other pdf which I do not need. It is fine for me if it downloads it but I need to distinguish them from the part where they are downloaded, this is why i am asking here to download all the pdf from just one specific subsection. So if it is possible to distinguish them in the name by section it would be also fine, for now this code gives me back 3000 pdfs, i need around 1400, one for each link, and if it keeps the name of the link it could be also easier for me, but is not my main worry since they are ordered in order of recall from the csv file and it will be easy to tidy them after.

In synthesis this code here needs to become a code which downloads only from one part of the site, instead of all of it:

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
#import pandas

#data = pandas.read_csv('urls.csv')
#urls = data['urls'].tolist()

urls = ["http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2020/0350", "http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2012/0299", "http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2013/0092"]
#url="http://www.europarl.europa.eu/oeil/FindByProcnum.do?lang=en&procnum=OLP/2013/0092"


folder_location = r'C:\Users\myname\Documents\R\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)

for url in urls:
 response = requests.get(url)
 soup= BeautifulSoup(response.text, "html.parser")     
 for link in soup.select("a[href$='EN.pdf']"):
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)

for example I did not want do download this file here follow up document which is a follow up document which starts with com, ends with EN.pdf, but has a different date because it is a follow up (in this case 2018) as you can see from the link: https://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2018/0564/COM_COM(2018)0564_EN.pdf

Solution

The links in your html file all seem to be to the same pdf [or at least they have the same filename], so it'll just be downloading and over-writing the same document. Still, if you just want to target only the first of those links, you could include the class externalDocument in your selector.

 for link in soup.select('a.externalDocument[href$="EN.pdf"]'):

If you want to target a specific event like 'Legislative proposal published', then you could do something like this:

# urls....os.mkdir(folder_location)

evtName = 'Legislative proposal published'

tdSel, spSel, aSel = 'div.ep-table-cell', 'span.ep_name', 'a[href$="EN.pdf"]'
dlSel = f'{tdSel}+{tdSel}+{tdSel} {spSel}>{aSel}' 
trSel = f'div.ep-table-row:has(>{dlSel}):has(>{tdSel}+{tdSel} {spSel})'

for url in urls:
    response = requests.get(url)
    soup= BeautifulSoup(response.text, "html.parser")

    pgPdfLinks = [
        tr.select_one(dlSel).get('href') for tr in soup.select(trSel) if 
        evtName.strip().lower() in 
        tr.select_one(f'{tdSel}+{tdSel} {spSel}').get_text().strip().lower()
        ## if you want [case sensitive] exact match, change condition to
        # tr.select_one(f'{tdSel}+{tdSel} {spSel}').get_text() == evtName
    ]     
    for link in pgPdfLinks[:1]:
        filename = os.path.join(folder_location, link.split('/')[-1])
        with open(filename, 'wb') as f:
            f.write(requests.get(urljoin(url, link)).content)

[The [:1] of pgPdfLinks[:1] is probably unnecessary since more than one match isn't expected, but it's there if you want to absolutely ensure only one download per page.]

Note: you need to be sure that there will be an event named evtName with a link matching aSel (a[href$="EN.pdf"] in this case) - otherwise, no PDF links will be found and nothing will be downloaded for those pages.

if it keeps the name of the link it could be also easier for me

It's already doing that in your code, since there doesn't seem to be much difference between link['href'].split('/')[-1] and link.get_text().strip(), but if you meant that you want the page link [i.e. the url], you could include the procnum (since that seems to be an identifying part of url) in your filename:

    # for link in...
        procnum = url.replace('?', '&').split('&procnum=')[-1].split('&')[0]
        procnum = ''.join(c if (
            c.isalpha() or c.isdigit() or c in '_-[]'
        ) else ('_' if c == '/' else '') for c in procnum)
        filename = f"proc-{procnum} {link.split('/')[-1]}"
        # filename = f"proc-{procnum} {link['href'].split('/')[-1]}" # in your current code

        filename = os.path.join(folder_location, filename)
        with open(filename, 'wb') as f:
            f.write(requests.get(urljoin(url, link)).content)
            # f.write(requests.get(urljoin(url['href'], link)).content) # in your current code

So, [for example] instead of saving to "COM_COM(2020)0791_EN.pdf", it will save to "proc-OLP_2020_0350 COM_COM(2020)0791_EN.pdf".

Answered By - Driftr95

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 9, 2024

[FIXED] Scraping specific pdfs from different websites

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels