Issue
Here is an example of my dataframe
id pdf
1 https://ia802902.us.archive.org/10/items/EL103_L_1978_03_024_01_1_PF_03/EL103_L_1978_03_024_01_1_PF_03.pdf
2 https://ia801900.us.archive.org/31/items/EL103_L_1978_03_033_07_1_PF_05/EL103_L_1978_03_033_07_1_PF_05.pdf
3 https://ia802900.us.archive.org/35/items/EL105_L_1978_03_072_03_1_PF_05/EL105_L_1978_03_072_03_1_PF_05.pdf
I want to download each pdf that is in column ['pdf']. I tried the following code (source: https://www.geeksforgeeks.org/downloading-pdfs-with-python-using-requests-and-beautifulsoup/)
import requests
from bs4 import BeautifulSoup
for url in df["pdf"]:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')
i = 0
for link in links:
if ('.pdf' in link.get('href', [])):
i += 1
print("Downloading file: ", i)
response = requests.get(link.get('href'))
pdf = open("C:/myfolder"+str(i)+".pdf", 'wb')
pdf.write(response.content)
pdf.close()
print("File ", i, " downloaded")
It starts running but it does not download any file. I would like to keep the original name of the pdf (for example: EL103_L_1978_03_024_01_1_PF_03.pdf). Any suggestion?
Solution
You can use this example how to download the PDFs:
import requests
for pdf_url in df["pdf"]:
file_name = pdf_url.split("/")[-1]
with open(file_name, "wb") as f_out:
print("Downloading", pdf_url)
f_out.write(requests.get(pdf_url).content)
Prints:
Downloading https://ia802902.us.archive.org/10/items/EL103_L_1978_03_024_01_1_PF_03/EL103_L_1978_03_024_01_1_PF_03.pdf
Downloading https://ia801900.us.archive.org/31/items/EL103_L_1978_03_033_07_1_PF_05/EL103_L_1978_03_033_07_1_PF_05.pdf
Downloading https://ia802900.us.archive.org/35/items/EL105_L_1978_03_072_03_1_PF_05/EL105_L_1978_03_072_03_1_PF_05.pdf
and saves them as:
andrej@MyPC:~/app$ ls -alF *pdf
-rw-r--r-- 1 root root 792942 sep 10 22:54 EL103_L_1978_03_024_01_1_PF_03.pdf
-rw-r--r-- 1 root root 559170 sep 10 22:54 EL103_L_1978_03_033_07_1_PF_05.pdf
-rw-r--r-- 1 root root 935443 sep 10 22:54 EL105_L_1978_03_072_03_1_PF_05.pdf
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.