Issue
Problem: I am trying to scrape the image source locations for pictures on a website, but I cannot get Beautiful Soup to scrape them successfully.
Details:
The three images I want have the following HTML tags:
<img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-1.jpg" style="display: none;">
<img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-2.jpg" style="display: none;">
<img src="https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-3.jpg" style="display: none;">
Code I've Tried:
soup.find_all('img')
soup.select('#imageFlicker')
soup.select('#imageFlicker > div')
soup.select('#imageFlicker > div > img:nth-child(1)')
soup.find_all('div', {'class':'exercise-post__step-image-wrap'})
soup.find_all('div', attrs={'id': 'imageFlicker'})
soup.select_all('#imageFlicker > div > img:nth-child(1)')
The very first query of soup.find_all('img')
gets every image on the page except the three images I want. I've tried looking at the children and sub children of each of the above, and none of that works either.
What am I missing here? I think there may be javascript that is changing the css display
attribute from block
to none
and back so the three images look like a gif instead of three different images. Is that messing things up in a way I'm not understanding? Thank you!
Solution
The content is provided dynmaically via JavaScript
, but not rendered by requests per se, unlike in the browser.
However, you can search for the JavaScript
variable:
var data = {"images":["https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-1.jpg","https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-2.jpg","https://ik.imagekit.io/02fmeo4exvw/exercise-library/large/14-3.jpg"],"interval":600};
with regex re.search()
and convert its content string with json.loads()
to JSON, so that you can access it easily.
Example
import requests
import re, json
url = 'https://www.acefitness.org/resources/everyone/exercise-library/14/bird-dog/'
json.loads(re.search(r'var data = (.*?);', requests.get(url).text).group(1))['images']
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.