Issue
I'm trying to use BeautifulSoup to parse an iframe containing a Korean news article and print out each individual body paragraph in the article. Because the Korean paragraph content lies in a p tag within its own td tag with a class id of "tlTD", I figured I could just loop through each td with that class name and print the p tag like so:
link ="https://gloss.dliflc.edu/GlossHtml/GlossHTML.html?disableBrowserLockout=true&gloss=true&glossLoXmlFileName=/GlossHtml/templates/linksLO/glossLOs/kp_cul312.xml&glossMediaPathRoot=https://gloss.dliflc.edu/products/gloss/"
base_url = "https://oda.dliflc.edu"
driver = webdriver.Chrome()
driver.get(link)
python_button = driver.find_element_by_id("gloss_link_source")
python_button.click()
source_src= driver.find_element_by_id("glossIframe").get_attribute("src")
source_url = urljoin(base_url, source_src)
driver.get(source_url)
soup = BeautifulSoup(driver.page_source, "lxml")
for td in soup.find_all("td", class_="tlTD"):
print(soup.find("p").getText())
The problem is that, instead of printing the body paragraphs, the code repeatedly prints out only the article title which lies in in its own td with a class of "title tlTD". I tried using a lambda expression and a regex to make the class name more exclusive, but I kept getting the same result. Changing soup.find("p")
to a find_all
successfully made the code print what I wanted, but it also printed a bunch of English version content that I don't want.
I can understand why the article title content would be printed since it includes "tlTD" in the class name, but I'm baffled as to where the English content is coming from. When I inspected the page in google chrome it didn't include any English body paragraphs so why is BeautifulSoup scraping that? Can anyone help explain to me what's going on here and how I can get this code to just print the Korean body paragraph content?
Solution
tlTD
class td tag
inside iframe, you can access iframe data easily like:
xpath to locate iframe :
iframe = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//iframe[@id='glossIframe']")))
Then switch_to the iframe:
driver.switch_to.frame(iframe)
Here's how to switch back to the default content (out of the ):
driver.switch_to.default_content()
EX:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
link = "https://gloss.dliflc.edu/GlossHtml/GlossHTML.html?disableBrowserLockout=true&gloss=true&glossLoXmlFileName=/GlossHtml/templates/linksLO/glossLOs/kp_cul312.xml&glossMediaPathRoot=https://gloss.dliflc.edu/products/gloss/"
driver.get(link)
source_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "gloss_link_source")))
source_button.click()
#switch iframe
iframe = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//iframe[@id='glossIframe']")))
driver.switch_to.frame(iframe)
soup = BeautifulSoup(driver.page_source, "lxml")
#scrape iframe data
for td in soup.find_all("td", class_="tlTD"):
print(td.find("p").getText())
Answered By - bharatk
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.