Issue
I want to scrape all the links from the Abstract and Early Life section of this page. https://en.wikipedia.org/wiki/Barack_Obama I want to store the links I get from the two sections separately. However, I am having issues with isolating that tag/class. I tried to use class "mw-headline" for the early life section but that is returning only the header text. Any hints are really appreciated.
I couldn't figure out how to the get abstract and early life sections separately.
url='https://en.wikipedia.org/wiki/Barack_Obama'
response = requests.get(url)
soup=bs(response.content,'html.parser')
page=soup.find('div',attrs={'id':'bodyContent'})
early_life=page.findAll('span',attrs={'class':'mw-headline'})
Solution
It's not very clear what format you want your output to be in, but the following will produce a list of dictionaries with the sections in several different formats:
First, the sections and abstract are all inside this div
, and are not nested any further into separate elements - so this starts by selecting the whole outer element and then going through its children:
content = soup.select_one('#mw-content-text > .mw-parser-output').children
splitContent = []
(splitContent
is the list that will be filled up with a dictionary for each section.)
for c in content:
if c.name == 'h2' or splitContent == []:
sectionName = 'Abstract' if splitContent == [] else c.text
splitContent.append({
'section': sectionName,
'listSoups': [], 'HTML': '', 'asText': ''
})
splitContent[-1]['listSoups'].append(c)
splitContent[-1]['HTML'] += str(c)
if c.name not in ['style', 'script']:
splitContent[-1]['asText'] += c.text
Each section header is wrapped as h2
*, so every time the loop gets to a child tag that's h2, a new dictionary is started, and the child object itself is always added to listSoups
in the last dictionary of the splitContent
list.
HTML
is saved too, so if you want a single bs4 object to be created for each section, splitContent
can be looped through:
for i in range(len(splitContent)):
splitContent[i]['asSoup'] = BeautifulSoup(splitContent[i]['HTML'], 'html.parser')
Now, you can see any of the sections in any of the formats added to the dictionaries.
Note that listSoups
is not the same as asSoup
. listSoups
is a list, and each item within is still connected to the original soup
variable and you can view its parent, nextSibling, etc in ways that would not be possible with asSoup
, which is a single object.
*Btw, using {'class':'mw-headline'}
will give you not just the main section headers, but also the subheaders. You can actually get something like a tree of the article structure with:
for h in soup.findAll('span',attrs={'class':'mw-headline'}):
hLevel = int(h.parent.name.replace('h', ''))
print(('\t'*(hLevel-2))+'↳', f'[{h.parent.name}] {h.text}')
Additional EDIT:
To get a dictionary of section texts, just use
sectnTexts_dict = dict([(
sc['section'].replace(' ', '_'), # section name to key
sc['asText'] # section text as value
) for sc in splitContent])
to view a truncated version, print dict((k, v[:50]+'...') for k, v in sectnTexts_dict.items())
, which looks like
{
"Abstract": "44th President of the United States\n\"Barack\" and \"...",
"Early_life_and_career": "Early life and career\nMain article: Early life and...",
"Legal_career": "Legal career\nCivil Rights attorney\nHe joined Davis...",
"Legislative_career": "Legislative career\nIllinois Senate (1997\u20132004)\nMai...",
"Presidential_campaigns": "Presidential campaigns\n2008\nMain articles: 2008 Un...",
"Presidency_(2009\u20132017)": "Presidency (2009\u20132017)\n First official portrait of...",
"Cultural_and_political_image": "Cultural and political image\nMain article: Public ...",
"Post-presidency_(2017\u2013present)": "Post-presidency (2017\u2013present)\n Obama with his the...",
"Legacy": "Legacy\n Job growth during the presidency of Obama ...",
"Bibliography": "Bibliography\nMain article: Bibliography of Barack ...",
"See_also": "See also\n\n\nBiography portal\nUnited States portal\nC...",
"References": "References\n\n^ \"Barack Hussein Obama Takes The Oath...",
"Further_reading": "Further reading\n\nDe Zutter, Hank (December 8, 1995...",
"External_links": "External links\nLibrary resources about Barack Oba..."
}
Answered By - Driftr95
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.