Issue
I'm currently in the process of systematically scraping data of an online retailer's website. I have been doing this once every week now for 2 months and my Python Code has been working great but when I tried to run it today, it returned blank files instead of my usual data. I tried multiple ways to solve this but haven't managed to fix it. I tried switching to geckodriver but same result. I also updated my selenium, chromedriver, chrome... but no luck. Has someone suggestions on fixing this? (this is my first post so hopefully I displayed the code clearly)
from bs4 import BeautifulSoup
import re
import csv
from selenium import webdriver
import numpy
url = "https://www.zalando.be/sportsokken/_zwart/"
driver = webdriver.chrome(executable_path = "/Users/lisabyloos/Downloads/chromedriver")
pages = numpy.arange(1,3,1)
for page in pages:
driver.get(url + "?p=" + str(page))
html_content = driver.execute_script('return document.body.innerHTML')
soup = BeautifulSoup(html_content, "lxml")
product_divs = soup.find_all("div", attrs={"class": "_4qWUe8 w8MdNG cYylcv QylWsg SQGpu8 iOzucJ JT3_zV DvypSJ"})
results = []
for product in product_divs:
results.append(product.get_text(separator=";"))
import pandas as pd
df = pd.DataFrame([sub.split(";") for sub in results])
df.to_csv("myfile" + str(page) + ".csv" )
Solution
What happens?
Classes of elements you try to find are dynamically generated and have changed.
Note Pages change from time to time, but changes to structure are rarer than to styles. Therefore, it is always a good strategy to use elements or ids rather than classes for selection.
How to fix?
Adjust selecting criteria to get your results:
product_divs = soup.find_all('article')
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.