Friday, December 31, 2021

[FIXED] web scraping gives only first 4 elements on a page

December 31, 2021 beautifulsoup, python, selenium, web-scraping No comments

Issue

I tried to scrap the search result elements on this page: https://shop.bodybuilding.com/search?q=protein+bar&selected_tab=Products with selenium but it gives me only the 4 first elements as a result. I am not sure why? it is a javascript page? and how can I scrap all the elements on this search page? here is the code I created :

import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path='C:/chromedriver')
url = 'https://shop.bodybuilding.com/search?q=protein+bar&selected_tab=Products'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
all_items = soup.find_all('div', {'class': 'ProductTile ProductTile--flat Animate AnimateOnHover Animate--fade-in Animate--animated'})


for i in range(len(all_items)):
    prices=all_items[i].find('div', {'class': 'Price ProductTile__price'}).text
    names=all_items[i].find('p', {'class': 'ProductTile__title'}).text
    images=all_items[i].find('img')['src']
    url=all_items[i].find('a', {'class': 'Anchor ProductTile__image'})['href']

    print(images)

this is the result for the names on this page, as you see it only scrapes the first 4 elements !

BSN Protein Crisp Bars
Optimum Nutrition Protein Wafers
Herbaland Vegan Protein Gummies
Battle Bars Full Battle Rattle (FBR) Protein Bar

the same for prices, images, and URLs?

Solution

How to fix

You have to scroll, so all items will be loaded:

last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    time.sleep(1)

    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

soup = BeautifulSoup(driver.page_source, 'html.parser')
all_items = soup.find_all('div', {'class': 'ProductTile ProductTile--flat Animate AnimateOnHover Animate--fade-in Animate--animated'})


for i in all_items:
    prices=i.find('div', {'class': 'Price ProductTile__price'}).text if i.find('div', {'class': 'Price ProductTile__price'}) else None
    names=i.find('p', {'class': 'ProductTile__title'}).text
    images=i.find('img')['src']
    url=i.find('a', {'class': 'Anchor ProductTile__image'})['href']

    print(images)

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 31, 2021

[FIXED] web scraping gives only first 4 elements on a page

Issue

Solution

How to fix

0 comments:

Post a Comment

Popular Posts

Labels