Issue
I want to crawl and download all the images I want, but the source code I wrote doesn't work. I am not very experienced with Python, so I would appreciate your help.
import requests
from bs4 import BeautifulSoup as bs
from bs4 import BeautifulSoup
import pyautogui
from selenium import webdriver
import os
import subprocess
import urllib.request
import time
res = requests.get("https://parkmu123.neocities.org/question")
html = res.text
soup = BeautifulSoup(html, "html.parser")
search_string1 = "parkmu123.neocities.org/"
search_string2 = ".jpg"
search_string3 = ".gif"
img_dest = "c:/users/root/desktop/img/"
for i in soup:
if (soup.startswitch(search_string1) and soup.endswitch(search_string2) or soup.startswitch(search_string1) and soup.endswitch(search_string3) ):
urllib.request.urlretrieve(i, img_dest+str(i+1)+".jpg or .gif")
else:
continue
I have succeeded in crawling web pages using "Beautifulsoup", but I am at a loss as to how to save all three image links on the homepage.
Solution
This expression
urllib.request.urlretrieve(i, img_dest+str(i+1)+".jpg or .gif")
creates a file which ends with literally ".jpg or .gif"
. It's also not clear what you hope str(i+1)
would do here; i
is not a number.
Furthermore, in
for i in soup:
if (soup.startswitch(search_string1) and soup.endswitch(search_string2) or soup.startswitch(search_string1) and soup.endswitch(search_string3) ):
you are repeatedly examining soup
, and ignoring i
. But furthermore, for i in soup
will just loop over the top-level objects in the soup
object (so basically the DOCTYPE declaration and the actual HTML tree).
Also, findall
is not a valid method; where did you get that from? And you mistyped startswith
and endswith
(see what they mean? "Starts with" and "ends with"?)
Here is a version which extracts actual images, and removes a ton of unused imports.
import os
import urllib.request
import requests
from bs4 import BeautifulSoup
res = requests.get("https://parkmu123.neocities.org/question")
soup = BeautifulSoup(res.text, "html.parser")
search_string1 = "https://parkmu123.neocities.org/" # notice https://
search_string2 = ".jpg"
search_string3 = ".gif"
img_dest = "c:/users/root/desktop/img/"
for i in soup.find_all("img"):
img = i.get("src")
if img.startswith(search_string1) and (img.endswith(search_string2) or img.endswith(search_string3)):
urllib.request.urlretrieve(img, os.path.join(img_dest, img.split("/")[-1]))
However, for the page you ask about, it does nothing; there are no valid image links, just garbage.
Also, mixing requests
and urllib
in the same script is slightly weird, but I wanted to focus on the obvious errors.
Answered By - tripleee
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.