Issue
I'm trying to extract the first image URL from a Slideshare presentation, so that I can then iterate through the page numbers and scrape the whole slideshow. The -1- before 2048 is the page number, so I can simply split the string to iterate through the pages.
Retrieving the image URL is proving problematic.
Here is my code:
import requests
from bs4 import BeautifulSoup
a = requests.get("https://www.slideshare.net/JSYashas/netflix-73262280")
soup = BeautifulSoup(a.content, 'lxml')
soup2 = soup.find_all()
and this is the image URL I'm trying to extract:
https://image.slidesharecdn.com/netflix-170317184749/75/netflix-1-2048.jpg?cb=1665800047
(these are both just examples I pulled from the internet, not the actual files I'm trying to work with.)
What I can't wrap my head around is what to use in find_all() in order to return this image URL.
Ideally, my intended solution was to look for the first occurrence of "-1-2048.jpg" and then use that to pull the full string, but I couldn't get this to work.
I like this approach because it's robust to different file paths and html structures, which I suspect are not uniform across Slideshare.
Any help is greatly appreciated.
Solution
Rather than using find_all
I suggest to search for picture
directly.
Try this:
a = requests.get("https://www.slideshare.net/JSYashas/netflix-73262280")
soup = BeautifulSoup(a.content, 'lxml')
pic = soup.find('picture', attrs={'data-testid':'slide-image-picture'}).find("source")["srcset"]
link = pic.split(" ")[4]
print(link)
Output
https://image.slidesharecdn.com/netflix-170317184749/75/netflix-1-2048.jpg?cb=1665800047
I tried this also for other slide decks and it works there too to retrieve the hires image.
Answered By - petezurich
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.