Issue
I am trying to download images from a web page using BeautifulSoup. I am getting the following error
MissingSchema: Invalid URL
import requests
from bs4 import BeautifulSoup
import os
from os.path import basename
url = "https://xxxxxx"
#r = requests.get(url)
request_page = urlopen(url)
page_html = request_page.read()
request_page.close()
soup = BeautifulSoup(page_html, 'html.parser')
#print(soup.title.text)
images = soup.find_all('img')
for image in images:
name = image['alt']
link =image['src']
with open(name.replace(' ', '-').replace('/', '') + 'jpg', 'wb') as f:
im = requests.get(link)
f.write(im.content)
print(images)
I am unsure why. I know I can read the images fine because the print works fine until I aded the following code
with open(name.replace(' ', '-').replace('/', '') + 'jpg', 'wb') as f:
im = requests.get(link)
f.write(im.content)
I would be grateful for any help thanks
EDIT
The url is
url = "https://en.wikipedia.org/wiki/Wikipedia:Picture_of_the_day/September_2018"
I added the print link as requested and the output is below
//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg/300px-Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/c/c5/Titian_-_Portrait_of_a_man_with_a_quilted_sleeve.jpg/280px-Titian_-_Portrait_of_a_man_with_a_quilted_sleeve.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/f/f7/Bee_on_Lavender_Blossom_2.jpg/250px-Bee_on_Lavender_Blossom_2.jpg
edit
I am just wondering if it the size of the name in the link? On looking at that it seems to be buried in a lot of folders before we get to the jpeg?
Solution
As I suspected based on the error, when you added that print statement you can see that the links you are trying to access are not valid urls.
//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg/300px-Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg
needs to start with https:
.
To fix this, simply add that to the image['src']
.
Second issue you need to fix is that when you write the file, you are writing it as 'Natalya-Naryshkinajpg'
. You need that with jpg
as the file extesions: for example 'Natalya-Naryshkina.jpg'
I fixed that as well.
Code:
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Wikipedia:Picture_of_the_day/September_2019"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
r = requests.get(url, headers=headers)
page_html = r.text
soup = BeautifulSoup(page_html, 'html.parser')
#print(soup.title.text)
images = soup.find_all('img')
for image in images:
name = image['alt']
link = 'https:' + image['src']
#print(link)
if 'static' not in link:
try:
extension = link.split('.')[-1]
with open(name.replace(' ', '-').replace('/', '') + '.' + extension, 'wb') as f:
im = requests.get(link, headers=headers)
f.write(im.content)
print(name)
except Exception as e:
print(e)
print(images)
Answered By - chitown88
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.