Sunday, April 10, 2022

[FIXED] BeautifulSoup: Missing Schema invalid url error

April 10, 2022 beautifulsoup, python No comments

Issue

I am trying to download images from a web page using BeautifulSoup. I am getting the following error

MissingSchema: Invalid URL

import requests
from bs4 import BeautifulSoup
import os
from os.path  import basename



url = "https://xxxxxx"

#r = requests.get(url)

request_page = urlopen(url)
page_html = request_page.read()
request_page.close()
soup = BeautifulSoup(page_html, 'html.parser')

#print(soup.title.text)
images = soup.find_all('img')
for image in images:
    name = image['alt']
    link =image['src']
    with open(name.replace(' ', '-').replace('/', '') + 'jpg', 'wb') as f:
        im = requests.get(link)
        f.write(im.content)
    

print(images)

I am unsure why. I know I can read the images fine because the print works fine until I aded the following code

with open(name.replace(' ', '-').replace('/', '') + 'jpg', 'wb') as f:
        im = requests.get(link)
        f.write(im.content)

I would be grateful for any help thanks

EDIT

The url is

url = "https://en.wikipedia.org/wiki/Wikipedia:Picture_of_the_day/September_2018"

I added the print link as requested and the output is below

//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg/300px-Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/c/c5/Titian_-_Portrait_of_a_man_with_a_quilted_sleeve.jpg/280px-Titian_-_Portrait_of_a_man_with_a_quilted_sleeve.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/f/f7/Bee_on_Lavender_Blossom_2.jpg/250px-Bee_on_Lavender_Blossom_2.jpg

edit

I am just wondering if it the size of the name in the link? On looking at that it seems to be buried in a lot of folders before we get to the jpeg?

Solution

As I suspected based on the error, when you added that print statement you can see that the links you are trying to access are not valid urls.

//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg/300px-Portrait_of_Tsaritsa_Natalya_Kirillovna_Naryshkina_-_Google_Cultural_Institute.jpg needs to start with https:.

To fix this, simply add that to the image['src'].

Second issue you need to fix is that when you write the file, you are writing it as 'Natalya-Naryshkinajpg'. You need that with jpg as the file extesions: for example 'Natalya-Naryshkina.jpg' I fixed that as well.

Code:

import requests
from bs4 import BeautifulSoup


url = "https://en.wikipedia.org/wiki/Wikipedia:Picture_of_the_day/September_2019"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

r = requests.get(url, headers=headers)
page_html = r.text
soup = BeautifulSoup(page_html, 'html.parser')

#print(soup.title.text)
images = soup.find_all('img')
for image in images:
    name = image['alt']
    link = 'https:' + image['src']
    #print(link)
    if 'static' not in link:
        try:
            extension = link.split('.')[-1]
            with open(name.replace(' ', '-').replace('/', '') + '.' + extension, 'wb') as f:
                im = requests.get(link, headers=headers)
                f.write(im.content)
                print(name)
        except Exception as e:
            print(e)
    
print(images)

Answered By - chitown88

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, April 10, 2022

[FIXED] BeautifulSoup: Missing Schema invalid url error

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels