Monday, June 20, 2022

[FIXED] Loop iterating over a list of BeautifulSoup Parse Tree Elements terminating too early

June 20, 2022 beautifulsoup, html, loops, parsing, python No comments

Issue

I am taking an Intro to Data Science Course and my first task is to extract certain Data Fields from each country page from the CIA World Factbook. Although I have recently become aware that there is easier ways to locate the data, I would like to follow through on my initial thought process which is as follows.

I developed a function which iterates over the result of:

for link in fulllink:
with urlopen(link) as countrypage:
    countrysoup = BeautifulSoup(countrypage, "lxml")
    data = countrysoup.find_all(class_="category_data")

I have confirmed that if the String Values I require exist on a country page they will be present in "data". The following function takes a tag and works with .parent and .previous_sibling to determine that the string value attached to tag is the one I am interested in extracting.

def get_wfb_data(x):
country_data = [0,0,0,0]
for tag in x:
    try:
        if 'Area:' in tag.parent.previous_sibling.previous_sibling.strings and 'total: ' in tag.parent.strings:
            country_data[0]=tag.string
        elif 'GDP (purchasing power parity):' in tag.previous_sibling.previous_sibling.strings:
            country_data[1]=tag.string
        elif 'Roadways:' in tag.parent.previous_sibling.previous_sibling.strings and 'total: ' in tag.parent.strings:
            country_data[2]=tag.string
        elif 'Railways:' in tag.parent.previous_sibling.previous_sibling.strings and 'total: ' in tag.parent.strings:
            country_data[3]=tag.string
        else:
            continue
    except:
        pass

return country_data

Exception Handling is used to deal with NavigableString Objects which do not have such attributes and thus raise an exception. Replacing values in a list of zeroes allows me to handle situations where a particular region has no data listed in one of the four categories I am interested in extracting. Furthermore, defined as four separate functions, the respective criterion work, but together are extremely slow as the list must be iterated over at most our times. However, the end result from this function is always a list where the first zero has been replaced but the others have not like as follows ["Total Area #",0,0,0].

I believe the loop terminates after matching the first if statement to a tag in x, how can I repair my function to continue down x?

Solution

I'm making the assumption that you're making this call: get_wfb_data(data)

Your function never fails, it simply never matches the rest of your conditions.

I tried it on a couple of different links (each contained data for GDP, Roadways, etc.)

By checking the length of x and printing the number of iterations through the for loop you can be assured that the loop is not terminating after matching the first condition and then failing to reach the last data element. It is in fact the remaining conditions that are never satisfied.

def get_wfb_data(x):
    country_data = [0, 0, 0, 0]
    print(len(x))
    i = 0
    for tag in x:
        i += 1
        try:
            if 'Area:' in tag.parent.previous_sibling.previous_sibling.strings \
                    and 'total: ' in tag.parent.strings:
                country_data[0] = tag.string
            elif 'GDP (purchasing power parity):' in tag.previous_sibling.previous_sibling.strings:
                country_data[1] = tag.string
            elif 'Roadways:' in tag.parent.previous_sibling.previous_sibling.strings \
                    and 'total: ' in tag.parent.strings:
                country_data[2] = tag.string
            elif 'Railways:' in tag.parent.previous_sibling.previous_sibling.strings \
                    and 'total: ' in tag.parent.strings:
                country_data[3] = tag.string
            else:
                continue
        except:
            pass
    print(str(i))
    return country_data

Answered By - user2096803

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, June 20, 2022

[FIXED] Loop iterating over a list of BeautifulSoup Parse Tree Elements terminating too early

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels