Monday, December 25, 2023

[FIXED] Python dowload all links in a dataframe column

December 25, 2023 beautifulsoup, download, pdf, python, python-requests No comments

Issue

Here is an example of my dataframe

id pdf
1  https://ia802902.us.archive.org/10/items/EL103_L_1978_03_024_01_1_PF_03/EL103_L_1978_03_024_01_1_PF_03.pdf
2 https://ia801900.us.archive.org/31/items/EL103_L_1978_03_033_07_1_PF_05/EL103_L_1978_03_033_07_1_PF_05.pdf
3 https://ia802900.us.archive.org/35/items/EL105_L_1978_03_072_03_1_PF_05/EL105_L_1978_03_072_03_1_PF_05.pdf

I want to download each pdf that is in column ['pdf']. I tried the following code (source: https://www.geeksforgeeks.org/downloading-pdfs-with-python-using-requests-and-beautifulsoup/)

import requests
from bs4 import BeautifulSoup

for url in df["pdf"]:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    links = soup.find_all('a')
    i = 0
    for link in links:
        if ('.pdf' in link.get('href', [])):
            i += 1
            print("Downloading file: ", i)
            
            response = requests.get(link.get('href'))
            pdf = open("C:/myfolder"+str(i)+".pdf", 'wb')
            pdf.write(response.content)
            pdf.close()
            print("File ", i, " downloaded")

It starts running but it does not download any file. I would like to keep the original name of the pdf (for example: EL103_L_1978_03_024_01_1_PF_03.pdf). Any suggestion?

Solution

You can use this example how to download the PDFs:

import requests

for pdf_url in df["pdf"]:
    file_name = pdf_url.split("/")[-1]
    with open(file_name, "wb") as f_out:
        print("Downloading", pdf_url)
        f_out.write(requests.get(pdf_url).content)

Prints:

Downloading https://ia802902.us.archive.org/10/items/EL103_L_1978_03_024_01_1_PF_03/EL103_L_1978_03_024_01_1_PF_03.pdf
Downloading https://ia801900.us.archive.org/31/items/EL103_L_1978_03_033_07_1_PF_05/EL103_L_1978_03_033_07_1_PF_05.pdf
Downloading https://ia802900.us.archive.org/35/items/EL105_L_1978_03_072_03_1_PF_05/EL105_L_1978_03_072_03_1_PF_05.pdf

and saves them as:

andrej@MyPC:~/app$ ls -alF *pdf
-rw-r--r-- 1 root root 792942 sep 10 22:54 EL103_L_1978_03_024_01_1_PF_03.pdf
-rw-r--r-- 1 root root 559170 sep 10 22:54 EL103_L_1978_03_033_07_1_PF_05.pdf
-rw-r--r-- 1 root root 935443 sep 10 22:54 EL105_L_1978_03_072_03_1_PF_05.pdf

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 25, 2023

[FIXED] Python dowload all links in a dataframe column

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels