Issue
I am taking an Intro to Data Science Course and my first task is to extract certain Data Fields from each country page from the CIA World Factbook. Although I have recently become aware that there is easier ways to locate the data, I would like to follow through on my initial thought process which is as follows.
I developed a function which iterates over the result of:
for link in fulllink:
with urlopen(link) as countrypage:
countrysoup = BeautifulSoup(countrypage, "lxml")
data = countrysoup.find_all(class_="category_data")
I have confirmed that if the String Values I require exist on a country page they will be present in "data". The following function takes a tag and works with .parent and .previous_sibling to determine that the string value attached to tag is the one I am interested in extracting.
def get_wfb_data(x):
country_data = [0,0,0,0]
for tag in x:
try:
if 'Area:' in tag.parent.previous_sibling.previous_sibling.strings and 'total: ' in tag.parent.strings:
country_data[0]=tag.string
elif 'GDP (purchasing power parity):' in tag.previous_sibling.previous_sibling.strings:
country_data[1]=tag.string
elif 'Roadways:' in tag.parent.previous_sibling.previous_sibling.strings and 'total: ' in tag.parent.strings:
country_data[2]=tag.string
elif 'Railways:' in tag.parent.previous_sibling.previous_sibling.strings and 'total: ' in tag.parent.strings:
country_data[3]=tag.string
else:
continue
except:
pass
return country_data
Exception Handling is used to deal with NavigableString Objects which do not have such attributes and thus raise an exception. Replacing values in a list of zeroes allows me to handle situations where a particular region has no data listed in one of the four categories I am interested in extracting. Furthermore, defined as four separate functions, the respective criterion work, but together are extremely slow as the list must be iterated over at most our times. However, the end result from this function is always a list where the first zero has been replaced but the others have not like as follows ["Total Area #",0,0,0].
I believe the loop terminates after matching the first if statement to a tag in x, how can I repair my function to continue down x?
Solution
I'm making the assumption that you're making this call: get_wfb_data(data)
Your function never fails, it simply never matches the rest of your conditions.
I tried it on a couple of different links (each contained data for GDP, Roadways, etc.)
By checking the length of x and printing the number of iterations through the for loop you can be assured that the loop is not terminating after matching the first condition and then failing to reach the last data element. It is in fact the remaining conditions that are never satisfied.
def get_wfb_data(x):
country_data = [0, 0, 0, 0]
print(len(x))
i = 0
for tag in x:
i += 1
try:
if 'Area:' in tag.parent.previous_sibling.previous_sibling.strings \
and 'total: ' in tag.parent.strings:
country_data[0] = tag.string
elif 'GDP (purchasing power parity):' in tag.previous_sibling.previous_sibling.strings:
country_data[1] = tag.string
elif 'Roadways:' in tag.parent.previous_sibling.previous_sibling.strings \
and 'total: ' in tag.parent.strings:
country_data[2] = tag.string
elif 'Railways:' in tag.parent.previous_sibling.previous_sibling.strings \
and 'total: ' in tag.parent.strings:
country_data[3] = tag.string
else:
continue
except:
pass
print(str(i))
return country_data
Answered By - user2096803
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.