Issue
My scraper is not consistently picking up the image URL on the page. Sometimes it does, most of the time it doesn't. When it doesn't pick up the URL, this is what I am getting in my CSV: data:
I cannot see what is wrong, can anyone help?
I've tried added sleep times to ensure all elements on the page have loaded, i've other pages on the same website and it's the same thing, sometimes it works sometimes it doesn't.
Should I use different method for picking up the element to session_image = session_soup.img['src']
?
I have also used this method many times to scrape other websites and never had this problem. Is it something to do with this particular website?
My code:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as browser_wait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import ElementClickInterceptedException
import time
import re
import csv
# initialize the chrome browser
browser = webdriver.Chrome(executable_path=r'./chromedriver')
browser.implicitly_wait(20)
# URL
class_pass_url = 'https://www.classpass.com'
# Create file and writes the first row, added encoding type as write was giving errors
f = open('ClassPass.csv', 'w', encoding='utf-8')
headers = 'IMAGE URL\n'
f.write(headers)
# classpass results page
page = "https://classpass.com/studios/sum-yoga-london"
browser.get(page)
# Browser waits
#browser_wait(browser, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "line")))
time.sleep(4)
# Scrolls to bottom of page to reveal all classes
# browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
studio_page_source = browser.page_source
studio_soup = soup(studio_page_source, "html.parser")
try:
studio_name = studio_soup.h2.text
except (AttributeError, TypeError,) as e:
pass
sessions = studio_soup.find_all('h3', {'class': '_4Fnd4DwToJFbbU5jAfqSv'})
for session in sessions:
twitter, facebook, instagram, session_website, telnumber, session_description = '', '', '', '', '', ''
session_link = class_pass_url + session.a['href']
browser.get(session_link)
#browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
#browser.execute_script("window.scrollTo(0,0);")
#time.sleep(2)
browser_wait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, '_1ruz3nW6mOnylv99BOA_tm')))
# parses individual class page
session_page_source = browser.page_source
session_soup = soup(session_page_source, "html.parser")
try:
session_image = session_soup.img['src']
except (AttributeError, TypeError,) as e:
pass
print(session_image)
f.write(
session_image +
"\n")
Solution
To get the image you can make the following changes in your code:
session_image = session_soup.find('meta', {'property': "og:image"})
session_image = session_image.get('content')
Answered By - Roman
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.