Sunday, January 30, 2022

[FIXED] Why does BeautifulSoup find keep returning elements with a class id other than what I'm passing it?

January 30, 2022 beautifulsoup, python, python-3.x, selenium No comments

Issue

I'm trying to use BeautifulSoup to parse an iframe containing a Korean news article and print out each individual body paragraph in the article. Because the Korean paragraph content lies in a p tag within its own td tag with a class id of "tlTD", I figured I could just loop through each td with that class name and print the p tag like so:

link ="https://gloss.dliflc.edu/GlossHtml/GlossHTML.html?disableBrowserLockout=true&gloss=true&glossLoXmlFileName=/GlossHtml/templates/linksLO/glossLOs/kp_cul312.xml&glossMediaPathRoot=https://gloss.dliflc.edu/products/gloss/"
base_url = "https://oda.dliflc.edu"

driver = webdriver.Chrome()
driver.get(link)
python_button = driver.find_element_by_id("gloss_link_source")
python_button.click() 

source_src= driver.find_element_by_id("glossIframe").get_attribute("src")
source_url = urljoin(base_url, source_src) 
driver.get(source_url)

soup = BeautifulSoup(driver.page_source, "lxml") 
for td in soup.find_all("td", class_="tlTD"):   
    print(soup.find("p").getText())

The problem is that, instead of printing the body paragraphs, the code repeatedly prints out only the article title which lies in in its own td with a class of "title tlTD". I tried using a lambda expression and a regex to make the class name more exclusive, but I kept getting the same result. Changing soup.find("p") to a find_all successfully made the code print what I wanted, but it also printed a bunch of English version content that I don't want.

I can understand why the article title content would be printed since it includes "tlTD" in the class name, but I'm baffled as to where the English content is coming from. When I inspected the page in google chrome it didn't include any English body paragraphs so why is BeautifulSoup scraping that? Can anyone help explain to me what's going on here and how I can get this code to just print the Korean body paragraph content?

Solution

tlTD class td tag inside iframe, you can access iframe data easily like:

xpath to locate iframe :

iframe = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//iframe[@id='glossIframe']")))

Then switch_to the iframe:

driver.switch_to.frame(iframe)

Here's how to switch back to the default content (out of the ):

driver.switch_to.default_content()

explicit-waits more details

EX:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
link = "https://gloss.dliflc.edu/GlossHtml/GlossHTML.html?disableBrowserLockout=true&gloss=true&glossLoXmlFileName=/GlossHtml/templates/linksLO/glossLOs/kp_cul312.xml&glossMediaPathRoot=https://gloss.dliflc.edu/products/gloss/"

driver.get(link)

source_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "gloss_link_source")))
source_button.click()

#switch iframe
iframe = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//iframe[@id='glossIframe']")))
driver.switch_to.frame(iframe)
 
soup = BeautifulSoup(driver.page_source, "lxml")
#scrape iframe data
for td in soup.find_all("td", class_="tlTD"):
    print(td.find("p").getText())

Answered By - bharatk

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 30, 2022

[FIXED] Why does BeautifulSoup find keep returning elements with a class id other than what I'm passing it?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels