Thursday, January 18, 2024

[FIXED] I want to download all images starting with "parkmu123.neocities.org/" and ending with ".jpg" or ".gif" from the crawled page

January 18, 2024 python, python-3.x No comments

Issue

I want to crawl and download all the images I want, but the source code I wrote doesn't work. I am not very experienced with Python, so I would appreciate your help.

import requests
from bs4 import BeautifulSoup as bs
from bs4 import BeautifulSoup
import pyautogui
from selenium import webdriver
import os
import subprocess
import urllib.request
import time

res = requests.get("https://parkmu123.neocities.org/question")
html = res.text
soup = BeautifulSoup(html, "html.parser")

search_string1 = "parkmu123.neocities.org/"
search_string2 = ".jpg"
search_string3 = ".gif"
img_dest = "c:/users/root/desktop/img/"

for i in soup:
    if (soup.startswitch(search_string1) and soup.endswitch(search_string2) or soup.startswitch(search_string1) and soup.endswitch(search_string3) ):
        urllib.request.urlretrieve(i, img_dest+str(i+1)+".jpg or .gif")
    else:
        continue

I have succeeded in crawling web pages using "Beautifulsoup", but I am at a loss as to how to save all three image links on the homepage.

Solution

This expression

        urllib.request.urlretrieve(i, img_dest+str(i+1)+".jpg or .gif")

creates a file which ends with literally ".jpg or .gif". It's also not clear what you hope str(i+1) would do here; i is not a number.

Furthermore, in

for i in soup:
    if (soup.startswitch(search_string1) and soup.endswitch(search_string2) or soup.startswitch(search_string1) and soup.endswitch(search_string3) ):

you are repeatedly examining soup, and ignoring i. But furthermore, for i in soup will just loop over the top-level objects in the soup object (so basically the DOCTYPE declaration and the actual HTML tree).

Also, findall is not a valid method; where did you get that from? And you mistyped startswith and endswith (see what they mean? "Starts with" and "ends with"?)

Here is a version which extracts actual images, and removes a ton of unused imports.

import os
import urllib.request

import requests
from bs4 import BeautifulSoup

res = requests.get("https://parkmu123.neocities.org/question")
soup = BeautifulSoup(res.text, "html.parser")

search_string1 = "https://parkmu123.neocities.org/"  # notice https://
search_string2 = ".jpg"
search_string3 = ".gif"
img_dest = "c:/users/root/desktop/img/"

for i in soup.find_all("img"):
    img = i.get("src")
    if img.startswith(search_string1) and (img.endswith(search_string2) or img.endswith(search_string3)):
        urllib.request.urlretrieve(img, os.path.join(img_dest, img.split("/")[-1]))

However, for the page you ask about, it does nothing; there are no valid image links, just garbage.

Also, mixing requests and urllib in the same script is slightly weird, but I wanted to focus on the obvious errors.

Answered By - tripleee

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 18, 2024

[FIXED] I want to download all images starting with "parkmu123.neocities.org/" and ending with ".jpg" or ".gif" from the crawled page

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels