Tuesday, January 18, 2022

[FIXED] Scraper not picking up image URL Beautiful Soup

January 18, 2022 beautifulsoup, python, selenium, selenium-webdriver, web-scraping No comments

Issue

My scraper is not consistently picking up the image URL on the page. Sometimes it does, most of the time it doesn't. When it doesn't pick up the URL, this is what I am getting in my CSV: data:

I cannot see what is wrong, can anyone help?

I've tried added sleep times to ensure all elements on the page have loaded, i've other pages on the same website and it's the same thing, sometimes it works sometimes it doesn't.

Should I use different method for picking up the element to session_image = session_soup.img['src']?

I have also used this method many times to scrape other websites and never had this problem. Is it something to do with this particular website?

My code:

from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as browser_wait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import ElementClickInterceptedException
import time
import re
import csv

# initialize the chrome browser
browser = webdriver.Chrome(executable_path=r'./chromedriver')
browser.implicitly_wait(20)

# URL
class_pass_url = 'https://www.classpass.com'

# Create file and writes the first row, added encoding type as write was giving errors
f = open('ClassPass.csv', 'w', encoding='utf-8')
headers = 'IMAGE URL\n'
f.write(headers)

# classpass results page
page = "https://classpass.com/studios/sum-yoga-london"

browser.get(page)

# Browser waits

#browser_wait(browser, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "line")))
time.sleep(4)

# Scrolls to bottom of page to reveal all classes
# browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

studio_page_source = browser.page_source
studio_soup = soup(studio_page_source, "html.parser")

try:
    studio_name = studio_soup.h2.text
except (AttributeError, TypeError,) as e:
    pass

sessions = studio_soup.find_all('h3', {'class': '_4Fnd4DwToJFbbU5jAfqSv'})

for session in sessions:
    twitter, facebook, instagram, session_website, telnumber, session_description = '', '', '', '', '', ''
    session_link = class_pass_url + session.a['href']
    browser.get(session_link)

    #browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    #browser.execute_script("window.scrollTo(0,0);")
    #time.sleep(2)

    browser_wait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, '_1ruz3nW6mOnylv99BOA_tm')))

    # parses individual class page
    session_page_source = browser.page_source
    session_soup = soup(session_page_source, "html.parser")


    try:
        session_image = session_soup.img['src']
    except (AttributeError, TypeError,) as e:
        pass

    print(session_image)

    f.write(

        session_image +

        "\n")

Solution

To get the image you can make the following changes in your code:

session_image = session_soup.find('meta', {'property': "og:image"})
session_image = session_image.get('content')

Answered By - Roman

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 18, 2022

[FIXED] Scraper not picking up image URL Beautiful Soup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels