Issue
I have a bit of a problem counting all H-tags in an article I need to keep the search inside the main article class-part of the code. It looks something like this.
<article class="Article-p6ncbx-0 hxYamq">
<div class ="">
<div class ="">
<div class ="">
<div class ="">
<div class ="">
<div class ="">
<h3>I need to search this one</h3>
</div>
</div>
</div>
</div>
</div>
</div>
</article>
<footer class="Footer-238971asd sdjkYsd">
<div class ="">
<div class ="">
<div class ="">
<div class ="">
<h3>But I dont want to find this H3-tag</h3>
Running this code will show all H1 to H4 tags on the page, also counting the header and footer, which both are outside the article class.
for heading in soup.find_all(["h1", "h2", "h3","h4"]):
print(heading.name + ' ' + heading.text.strip())
I'm new to this and have a hard time understand how I can keep the search inside the article class. Any help would be very appreciated.
I understand this topic has been covered in length before, but I can't find a solution to this specific issue where I need to keep inside the class. Feel free to correct me if this could be solved by a simple search.
Here is a screenshot of how the entire thing looks. Here is the actual page also.
Solution
To count / print only the headings from the articles - first select all <article>
from soup and second find_all()
headings in selection:
import requests
from bs4 import BeautifulSoup
result = requests.get('https://www.prisjakt.nu/sa-valjer-du-ratt-grill--ecXqqVohAAACIARDWF')
soup = BeautifulSoup(result.content, 'lxml')
for article in soup.select('article'):
for heading in article.find_all(['h1', 'h2', 'h3','h4']):
print(heading.name + ' ' + heading.text.strip())
Output:
h1 Så väljer du rätt grill
h4 Kolgrill, gasolgrill, elgrill – vad ska man egentligen välja? Här får du tipsen och råden du behöver innan du väljer!
h3 Kolgrillen
h4 Fördelar
h4 Nackdelar
h4 3 populäraste kolgrillarna våren 2021
h3 Grillkol eller briketter?
h3 Gasolgrillen
h4 3 populäraste gasolgrillarna våren 2021
h3 Elgrillen
h4 3 populäraste elgrillarna våren 2021
h3 Prisjakts grilltips
Instead of text.strip()
you can also use get_text(strip=True)
`
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.