Issue
I want to scrape content from following webpage:
Code
I have this web-scraping script:
import requests
from bs4 import BeautifulSoup
# URL of the webpage to scrape
url = "https://anastrophe.uchicago.edu/cgi-bin/perseus/morph.pl?token=%CF%84%E1%BF%B7&lang=greek"
# Send a GET request to the webpage
response = requests.get(url)
# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.content, "html.parser")
# Find all tables with class "lemmacontainer"
tables = soup.find_all("table", class_="lemmacontainer")
# Iterate over each table
for table in tables:
# Find the link within the table
link = table.find("a")["href"]
# Find all code texts within the table
code_texts = table.find_all("td", class_="code")
# Extract and print the link and code texts
print("Link:", link)
for code_text in code_texts:
print("Code Text:", code_text.get_text(strip=True))
print()
If you run the script in a terminal, it outputs:
Link: http://logeion.uchicago.edu/ὅς
Code Text: relative pronoun masc. dat. sg.(ionic)
Code Text: relative pronoun neut. dat. sg.(ionic)
Link: http://logeion.uchicago.edu/ὁ
Code Text: definite article neut. dat. sg.
Code Text: definite article masc. dat. sg.
Code Text: adverb
Code Text: interrogative pronoun common dat. sg.
Code Text: interrogative pronoun neut. dat. sg.
Link: http://logeion.uchicago.edu/τῷ
Code Text: adverb
Code Text: interrogative pronoun common dat. sg.
Code Text: interrogative pronoun neut. dat. sg.
Link: http://logeion.uchicago.edu/τίς
Code Text: interrogative pronoun common dat. sg.
Code Text: interrogative pronoun neut. dat. sg.
Issue
But, as one can see by visiting the webpage, Table 3 is supposed to only have one "code" text ('adverb'), but it has three in the output, and Table 2 is supposed to only have two "code" texts ('definite article neut. dat. sg.' and 'definite article masc. dat. sg.') but there are five.
Restrictions and What I have tried
I want to use the script on multiple pages, so I can't just truncate the output.
Tried iterating outside the tables, and other things, but I can't get the script to do what it's supposed to.
Question
What do I have to do to make the script scrape and output the correct "code" texts per table?
Solution
Because the site serves a faulty format of html. For e.g, this is raw html served by the site:
...
<div class=idcontainer tokenid= />
...
<table class=lemmacontainer>
...
For these not standard html, different parsers may end up in different results. That's why bs4 was result different with browser and your query result was strange.
Answered By - namgold
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.