Issue
I am trying to get all the image urls for all the books on this page https://www.nb.co.za/en/books/0-6-years
with beautiful soup.
This is my code:
from bs4 import BeautifulSoup
import requests
baseurl = "https://www.nb.co.za/"
productlinks = []
r = requests.get(f'https://www.nb.co.za/en/books/0-6-years')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('div', class_="book-slider-frame")
def my_filter(tag):
return (tag.name == 'a' and
tag.parent.name == 'div' and
'img-container' in tag.parent['class'])
for item in productlist:
for link in item.find_all(my_filter, href=True):
productlinks.append(baseurl + link['href'])
cover = soup.find_all('div', class_="img-container")
print(cover)
And this is my output:
<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>
What I hope to get:
https://www.nb.co.za/en/helper/ReadImage/25929.jpg
My problem is:
How do I get the data-sourcre only?
How to I get the extension of the image?
Solution
1: How do I get the data-source only?
You can access the data-src
by calling element['data-src']
:
cover = baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
2: How to I get the extension of the image?
You can access the extension of the file like diggusbickus mentioned (good approache), but this will not help you if you try to request the file like https://www.nb.co.za/en/helper/ReadImage/25929.jpg this will cause a 404 error.
The image is dynamically loaded / served additional infos -> https://stackoverflow.com/a/5110673/14460824
Example
baseurl = "https://www.nb.co.za/"
nocover = '/Content/images/no-cover.jpg'
data = []
for item in soup.select('.book-slider-frame'):
data.append({
'link' : baseurl+item.a['href'],
'cover' : baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
})
data
Output
[{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182539',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25929'},
{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182546',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25931'},
{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182553',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25925'},...]
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.