Issue
So I'm trying to exclude (not extract) the info contained in a span. Here's the HTML:
<li><span>Type:</span> Cardiac Ultrasound</li>
And here's my code:
item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos:
description_elements = description_el.find('span')
for el in description_elements:
curr_el = {}
key = el.replace(':', '')
print(el)
print(description_el.text.replace(' ', ''))
Where listing soup is basically the whole page (in my example the HTML) When I do that I get:
Type:
Type: CardiacUltrasound
As you can see. For some extraordinary reason :P, the span
isn't affected by my replace()
method even-though .text
yields a str
EDIT: Sorry. My objective is to create a bunch of dictionnaries
where the key
is the span
and the value
what comes after it.
Solution
NOTE: Be careful about "creating a bunch of dictionaries", as dictionaries can't have duplicate keys. But you could have a list of dictionaries, which in that case, won't matter (well still matters within each individual dictionary).
Option 1:
Use .next_sibling()
from bs4 import BeautifulSoup
html = '''
<div class="item_description">
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''
listing_soup = BeautifulSoup(html, 'html.parser')
item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos:
k = description_el.find('span').text.replace(':', '')
v = description_el.find('span').next_sibling.strip()
print(k)
print(v)
Option 2:
Just get the text from description_el
, the .split(':')
. Then you got the 2 elements you want (if I'm reading your question correctly.
from bs4 import BeautifulSoup
html = '''
<div class="item_description">
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''
listing_soup = BeautifulSoup(html, 'html.parser')
item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos:
descText = description_el.text.split(':', 1)
k = descText[0].strip()
v = descText[-1].strip()
print(k)
print(v)
Option 3:
Get the <span>
text. Remove it. Then get the remaining text in the <li>
. Although since you're not wanting to extract, might not be useful to you.
from bs4 import BeautifulSoup
html = '''
<div class="item_description">
<li><span>Type:</span> Cardiac Ultrasound</li></div>'''
listing_soup = BeautifulSoup(html, 'html.parser')
item_description_infos = listing_soup.find(class_='item_description').find_all('li')
for description_el in item_description_infos:
k = description_el.find('span').text.replace(':','')
description_el.find('span').extract()
v = description_el.text.strip()
print(k)
print(v)
Output:
Type
Cardiac Ultrasound
Answered By - chitown88
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.