Issue
I'm working through a book called "Learn Python by Building Data Science Applications", and there's a chapter on web scraping, which I fully admit I have not played with before. I've reached a portion where it discusses unordered lists and how to work with them, and my code is generating an error that doesn't make sense to me:
Traceback (most recent call last): File "/Users/gillian/100-days-of-code/Learn-Python-by-Building-Data-Science-Applications/Chapter07/wiki2.py", line 77, in list_element = front.find_next_siblings("div", "div-col columns column-width")[0].ul IndexError: list index out of range
My first thought was that there simply wasn't an unordered list on the page anymore, but I checked, and... there is. My interpretation of this error is that it's not returning the list but I'm having trouble figuring out how to test that, and I fully admit that recursion makes me dizzy and it's not my best area.
My full code is attached (including the notes I took, hence the giant blocks of comments)
'''scrapes list of WWII battles'''
import requests as rq
base_url = 'https://en.wikipedia.org/wiki/List_of_World_War_II_battles'
response = rq.get(base_url)
'''access the raw content of a page with response.content'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
def get_dom(url):
response = rq.get(url)
response.raise_for_status()
return BeautifulSoup(response.content, 'html.parser')
'''3 ways to search for an element:
1. find
2. finda_all
3. select
for 1 and 2 you pass an object type and attributes, maybe,
a recursive argument defines if the search should be recursive
First method retrieves first occurrence
Second method will always return a list with all elements
select will return a list and expects you to pass a single CSS selector string
this makes select easier to use, sometimes
'''
content = soup.select('div#mw-content-text > div.mw-parser-output', limit=1)[0]
'''
collect corresponding elements for each front, which are all h2 headers
all fronts are sections - each with a title in h2 but hierarchically the titles are not nested within the sections
last title is citations and notes
one way is to just drop the last element or we can use a CSS Selector trick, which is to specify :not(:last-of-type) but that is less readable
'''
fronts = content.select('div.mw-parser-output>h2')[:-1]
for el in fronts:
print(el.text[:-6])
'''getting the corresponding ul lists for each header
bs4 has a find_next_siblings method that works like find_all except that it will look in the document after each element
to get this all simultaneously, we'll need to use recursion
'''
def dictify(ul, level=0):
result = dict()
for li in ul.find_all("li", recursive=False):
text = li.stripped_strings
key = next(text)
try:
time = next(text).replace(':', '').strip()
except StopIteration:
time = None
ul, link = li.find("ul"), li.find('a')
if link:
link = _abs_link(link.get('href'))
r ={'url': link,
'time':time,
'level': level}
if ul:
r['children'] = dictify(ul, level=(level + 1))
result[key] = r
return result
theaters = {}
for front in fronts:
list_element = front.find_next_siblings("div", "div-col columns column-width")[0].ul
theaters[front.text[:-6]] = dictify(list_element)
If anyone has any input about how I can proceed to troubleshoot this, I'd really appreciate it. Thanks.
Solution
The error means that .find_next_siblings
didn't find anything. Try to change it to front.find_next_siblings("div", "div-col")
. Also _abs_link()
isn't specified, so I removed it:
"""scrapes list of WWII battles"""
import requests as rq
base_url = "https://en.wikipedia.org/wiki/List_of_World_War_II_battles"
response = rq.get(base_url)
"""access the raw content of a page with response.content"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
def get_dom(url):
response = rq.get(url)
response.raise_for_status()
return BeautifulSoup(response.content, "html.parser")
"""3 ways to search for an element:
1. find
2. finda_all
3. select
for 1 and 2 you pass an object type and attributes, maybe,
a recursive argument defines if the search should be recursive
First method retrieves first occurrence
Second method will always return a list with all elements
select will return a list and expects you to pass a single CSS selector string
this makes select easier to use, sometimes
"""
content = soup.select("div#mw-content-text > div.mw-parser-output", limit=1)[0]
"""
collect corresponding elements for each front, which are all h2 headers
all fronts are sections - each with a title in h2 but hierarchically the titles are not nested within the sections
last title is citations and notes
one way is to just drop the last element or we can use a CSS Selector trick, which is to specify :not(:last-of-type) but that is less readable
"""
fronts = content.select("div.mw-parser-output>h2")[:-1]
for el in fronts:
print(el.text[:-6])
"""getting the corresponding ul lists for each header
bs4 has a find_next_siblings method that works like find_all except that it will look in the document after each element
to get this all simultaneously, we'll need to use recursion
"""
def dictify(ul, level=0):
result = dict()
for li in ul.find_all("li", recursive=False):
text = li.stripped_strings
key = next(text)
try:
time = next(text).replace(":", "").strip()
except StopIteration:
time = None
ul, link = li.find("ul"), li.find("a")
if link:
link = link.get("href")
r = {"url": link, "time": time, "level": level}
if ul:
r["children"] = dictify(ul, level=(level + 1))
result[key] = r
return result
theaters = {}
for front in fronts:
list_element = front.find_next_siblings("div", "div-col")[0].ul
theaters[front.text[:-6]] = dictify(list_element)
print(theaters)
Prints:
{
"African Front": {
"North African campaign": {
"url": "/wiki/North_African_campaign",
"time": "June 1940 - May 1943",
"level": 0,
"children": {
"Western Desert campaign": {
"url": "/wiki/Western_Desert_campaign",
"time": "June 1940 – February 1943",
"level": 1,
"children": {
"Italian invasion of Egypt": {
"url": "/wiki/Italian_invasion_of_Egypt",
"time": "September 1940",
"level": 2,
},
"Operation Compass": {
"url": "/wiki/Operation_Compass",
"time": "December 1940 – February 1941",
"level": 2,
"children": {
"Battle of Nibeiwa": {
"url": "/wiki/Battle_of_Nibeiwa",
"time": "December 1940",
"level": 3,
},
...and so on.
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.