Issue
In the following HTML code, trying to extract AND organize the extracted output:
html_doc = """
<html>
<body>
<ul class="unordered-list">
<li class="menu-category">
<div class="h4 category-name section-title">Birds Toys</div>
<div class="category-description">Toys belonging to the Bird Category</div>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Eagle</span>
<span class="item-price">$40.00</span>
</div>
<p class="description">Eagle is the national bird of the US.</p>
</li>
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Parrot</span>
<span class="item-price">$14.00</span>
</div>
<p class="description">Parrot is found in tropical and subtropical region.</p>
</li>
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Owls</span>
<span class="item-price">$23.00</span>
</div>
<p class="description">Owls are nocturnal.</p>
</li>
</ul>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Kingfisher</span>
<span class="item-price">$13.00</span>
</div>
<p class="description">Kigfisher hunts in the water</p>
</li>
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Quail</span>
<span class="item-price">$22.00</span>
</div>
<p class="description"></p>
</li>
</ul>
</li>
</ul>
<ul class="unordered-list">
<li class="menu-category">
<div class="h4 category-name section-title">Reptiles Toys</div>
<div class="category-description">Toys belonging to Reptiles Category</div>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Snake</span>
<span class="item-price">$7.00</span>
</div>
<p class="description">Snakes can be poisonous.</p>
</li>
</ul>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Lizard</span>
<span class="item-price">$7.00</span>
</div>
<p class="description">Lizards are found both at homes and in jungle</p>
</li>
</ul>
</li>
</ul>
<ul class="unordered-list">
<li class="menu-category">
<div class="h4 category-name section-title">Germs Toys</div>
<div class="category-description">Toys that belong to germs category</div>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Bacteria</span>
<span class="item-price">$12.95</span>
</div>
<p class="description">Bacteria can cause tuberclausis</p>
</li>
</ul>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Protozoa</span>
<span class="item-price">$11.95</span>
</div>
<p class="description"></p>
</li>
</ul>
<ul class="menu-items-list unordered-list">
<li class="menu-items">
<div class="h6 item-main">
<span class="item-title">Virus</span>
<span class="item-price">$12.95</span>
</div>
<p class="description">Viruses are known to cause Corona, Aids, etc.</p>
</li>
</ul>
</li>
</ul>
</body>
</html>
"""
I am able to successfully extract the div-class, span-class, p-class combinations using the following code:
soup = BeautifulSoup(html_doc)
with open("output.txt", "w") as output:
# ITEM CLASS find a list of all div elements
divitemscatg = soup.find_all('div', {'class' : 'h4 category-name section-title'})
linesdivitemscatg = [span.get_text() for span in divitemscatg]
print(linesdivitemscatg)
# ITEM TITLE find a list of all span elements
spansitemtitle = soup.find_all('span', {'class' : 'item-title'})
linesitemtitle = [span.get_text() for span in spansitemtitle]
print(linesitemtitle)
# ITEM PRICE find a list of all span elements
spansitemprice = soup.find_all('span', {'class' : 'item-price'})
linesitemprice = [span.get_text() for span in spansitemprice]
print(linesitemprice)
# DESC find a list of all span elements
spansitemdesc = soup.find_all('p', {'class' : 'description'})
linesitemdesc = [span.get_text() for span in spansitemdesc]
print(linesitemdesc)
The Output I am getting is:
['Birds Toys', 'Reptiles Toys', 'Germs Toys']
['Eagle', 'Parrot', 'Owls', 'Kingfisher', 'Quail', 'Snake', 'Lizard', 'Bacteria', 'Protozoa', 'Virus']
['$40.00', '$14.00', '$23.00', '$13.00', '$22.00', '$7.00', '$7.00', '$12.95', '$11.95', '$12.95']
['Eagle is the national bird of the US.', 'Parrot is found in tropical and subtropical region.', 'Owls are nocturnal.', 'Kigfisher hunts in the water', '', 'Snakes can be poisonous.', 'Lizards are found both at homes and in jungle', 'Bacteria can cause tuberclausis', '', 'Viruses are known to cause Corona, Aids, etc.']
But I need the output as differently organized as follows:
Birds Toys|Eagle|$40.00|Eagle is the national bird of the US.
Birds Toys|Parrot|$14.00|Parrot is found in tropical and subtropical region.
Birds Toys|Owls|$23.00|Owls are nocturnal.
Birds Toys|Kingfisher|$13.00|Kigfisher hunts in the water
Birds Toys|Quail|$22.00|
Reptiles Toys|Snake|$7.00|Snakes can be poisonous.
Reptiles Toys|Lizard|$7.00|Lizards are found both at homes and in jungle
Germs Toys|Bacteria|$12.95|Bacteria can cause tuberclausis
Germs Toys|Protozoa|$11.95|
Germs Toys|Virus|$12.95|Viruses are known to cause Corona, Aids, etc.
What changes are needed in the code above to achieve the latter. I am unable to get this arranged properly in the desired format.
Thanks in advance.
Solution
You could get your goal this way - Select each menu-item, find its previous category and prepend it to your content:
soup=BeautifulSoup(html_doc)
with open("output.txt", "w") as output:
for l in soup.select('.menu-items'):
data = [
l.find_previous('div',{'class':'h4'}).text,
l.select_one('.item-title').text,
l.select_one('.item-price').text,
l.select_one('.description').text
]
output.write('|'.join(data)+'\n')
Output
Birds Toys|Eagle|$40.00|Eagle is the national bird of the US.
Birds Toys|Parrot|$14.00|Parrot is found in tropical and subtropical region.
Birds Toys|Owls|$23.00|Owls are nocturnal.
Birds Toys|Kingfisher|$13.00|Kigfisher hunts in the water
Birds Toys|Quail|$22.00|
Reptiles Toys|Snake|$7.00|Snakes can be poisonous.
Reptiles Toys|Lizard|$7.00|Lizards are found both at homes and in jungle
Germs Toys|Bacteria|$12.95|Bacteria can cause tuberclausis
Germs Toys|Protozoa|$11.95|
Germs Toys|Virus|$12.95|Viruses are known to cause Corona, Aids, etc.
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.