Sunday, December 26, 2021

[FIXED] Get image data-src with Beautiful Soup when there is no image extension

December 26, 2021 beautifulsoup, python, web-scraping No comments

Issue

I am trying to get all the image urls for all the books on this page https://www.nb.co.za/en/books/0-6-years with beautiful soup.

This is my code:

from bs4 import BeautifulSoup
import requests

baseurl = "https://www.nb.co.za/"
productlinks = []

r = requests.get(f'https://www.nb.co.za/en/books/0-6-years')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('div', class_="book-slider-frame")

def my_filter(tag):
    return (tag.name == 'a' and
        tag.parent.name == 'div' and
        'img-container' in tag.parent['class'])

for item in productlist:
    for link in item.find_all(my_filter, href=True):
        productlinks.append(baseurl + link['href'])

        cover = soup.find_all('div', class_="img-container")
        print(cover)

And this is my output:

<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>

What I hope to get:

https://www.nb.co.za/en/helper/ReadImage/25929.jpg

My problem is:

How do I get the data-sourcre only?
How to I get the extension of the image?

Solution

1: How do I get the data-source only?

You can access the data-src by calling element['data-src']:

cover = baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover

2: How to I get the extension of the image?

You can access the extension of the file like diggusbickus mentioned (good approache), but this will not help you if you try to request the file like https://www.nb.co.za/en/helper/ReadImage/25929.jpg this will cause a 404 error.

The image is dynamically loaded / served additional infos -> https://stackoverflow.com/a/5110673/14460824

Example

baseurl = "https://www.nb.co.za/"
nocover = '/Content/images/no-cover.jpg'
data = []

for item in soup.select('.book-slider-frame'):
    
    data.append({
        'link' : baseurl+item.a['href'],
        'cover' : baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
    })
    
data

Output

[{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182539',
  'cover': 'https://www.nb.co.za//en/helper/ReadImage/25929'},
 {'link': 'https://www.nb.co.za//en/view-book/?id=9780798182546',
  'cover': 'https://www.nb.co.za//en/helper/ReadImage/25931'},
 {'link': 'https://www.nb.co.za//en/view-book/?id=9780798182553',
  'cover': 'https://www.nb.co.za//en/helper/ReadImage/25925'},...]

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 26, 2021

[FIXED] Get image data-src with Beautiful Soup when there is no image extension

Issue

Solution

Example

Output

0 comments:

Post a Comment

Popular Posts

Labels