Wednesday, April 6, 2022

[FIXED] bs4 `next_sibling` VS `find_next_sibling`

April 06, 2022 beautifulsoup, python, python-3.x, web-scraping No comments

Issue

I struggling with usage of next_sibling (and similarly with next_element). If used as attributes I don't get anything back but if used as find_next_sibling (or find_next) then it works. From the doc:

find_next_sibling: "Iterate over the rest of an element’s siblings in the tree. [...] returns the first one (of the match)"
find_next: "These methods use .next_elements to iterate over [...] and returns the first one"

So, find_next_sibling depends on next_siblings. On what does next_sibling depends on and why do they return nothing?

from bs4 import BeautifulSoup

html = """
<div class="......>
 <div class="one-ad-desc">
  <div class="one-ad-title">
   <a class="one-ad-link" href="www this is the URL!">
    <h5>
     Text needed
    </h5>
   </a>
  </div>
  <div class="one-ad-desc">
    ...and some more needed text here!
  </div>
 </div>
</div>
"""

soup = BeautifulSoup(html, 'lxml')

for div in soup.find_all('div', class_="one-ad-title"):
    print('-> ', div.next_element)
    print('-> ', div.next_sibling)
    print('-> ', div.find_next_sibling())-> ')
    break

Output

->  

->  

->  <div class="one-ad-desc">
    ...and some more needed text here!
  </div>

Solution

The main point here in my opinion is that .find_next_sibling() scope is on next level on the tree.

While .next_element and .next_sibling scope is on the same level of the parse tree.

So take a look and print the name of the elements and you will see next element is not a tag, cause there is nothing on same level of the tree :

for div in soup.find_all('div', class_="one-ad-title"):
    print('-> ', div.next_element.name)
    print('-> ', div.next_sibling.name)
    print('-> ', div.find_next_sibling().name)

#output
->  None
->  None
->  div

So if you change your input to one line and no spaces,... between tags you got the following result:

from bs4 import BeautifulSoup

html = """
<div class="......><div class="one-ad-desc"><div class="one-ad-title"><a class="one-ad-link" href="www this is the URL!"><h5>Text needed</h5></a></div><div class="one-ad-desc">...and some more needed text here!</div></div></div>"""

soup = BeautifulSoup(html, 'lxml')

for div in soup.find_all('div', class_="one-ad-title"):
    print('-> ', div.next_element)
    print('-> ', div.next_sibling)
    print('-> ', div.find_next_sibling())

Output:

->  <a class="one-ad-link" href="www this is the URL!"><h5>Text needed</h5></a>
->  <div class="one-ad-desc">...and some more needed text here!</div>
->  <div class="one-ad-desc">...and some more needed text here!</div>

Note "text needed" is not in a sibling of your selected tag, it is in one of its children. To select "text needed" -> print('-> ', div.find_next().text)

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, April 6, 2022

[FIXED] bs4 `next_sibling` VS `find_next_sibling`

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels