Monday, December 4, 2023

[FIXED] Why does my Python script scrape incorrectly?

December 04, 2023 beautifulsoup, python, web-scraping No comments

Issue

I want to scrape content from following webpage:

Code

I have this web-scraping script:

import requests
from bs4 import BeautifulSoup

# URL of the webpage to scrape
url = "https://anastrophe.uchicago.edu/cgi-bin/perseus/morph.pl?token=%CF%84%E1%BF%B7&lang=greek"

# Send a GET request to the webpage
response = requests.get(url)

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.content, "html.parser")

# Find all tables with class "lemmacontainer"
tables = soup.find_all("table", class_="lemmacontainer")

# Iterate over each table
for table in tables:
    # Find the link within the table
    link = table.find("a")["href"]
    
    # Find all code texts within the table
    code_texts = table.find_all("td", class_="code")
    
    # Extract and print the link and code texts
    print("Link:", link)
    for code_text in code_texts:
        print("Code Text:", code_text.get_text(strip=True))
    print()

If you run the script in a terminal, it outputs:

Link: http://logeion.uchicago.edu/ὅς
Code Text: relative pronoun masc. dat. sg.(ionic)
Code Text: relative pronoun neut. dat. sg.(ionic)

Link: http://logeion.uchicago.edu/ὁ
Code Text: definite article neut. dat. sg.
Code Text: definite article masc. dat. sg.
Code Text: adverb
Code Text: interrogative pronoun common dat. sg.
Code Text: interrogative pronoun neut. dat. sg.

Link: http://logeion.uchicago.edu/τῷ
Code Text: adverb
Code Text: interrogative pronoun common dat. sg.
Code Text: interrogative pronoun neut. dat. sg.

Link: http://logeion.uchicago.edu/τίς
Code Text: interrogative pronoun common dat. sg.
Code Text: interrogative pronoun neut. dat. sg.

Issue

But, as one can see by visiting the webpage, Table 3 is supposed to only have one "code" text ('adverb'), but it has three in the output, and Table 2 is supposed to only have two "code" texts ('definite article neut. dat. sg.' and 'definite article masc. dat. sg.') but there are five.

Restrictions and What I have tried

I want to use the script on multiple pages, so I can't just truncate the output.

Tried iterating outside the tables, and other things, but I can't get the script to do what it's supposed to.

Question

What do I have to do to make the script scrape and output the correct "code" texts per table?

Solution

Because the site serves a faulty format of html. For e.g, this is raw html served by the site:

...
<div class=idcontainer tokenid= />
...
<table class=lemmacontainer>
...

For these not standard html, different parsers may end up in different results. That's why bs4 was result different with browser and your query result was strange.

Answered By - namgold

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 4, 2023

[FIXED] Why does my Python script scrape incorrectly?

Issue

Code

Issue

Restrictions and What I have tried

Question

Solution

0 comments:

Post a Comment

Popular Posts

Labels