Monday, December 6, 2021

[FIXED] BeautifulSoup How to remove tags whose text has specific value

December 06, 2021 beautifulsoup, python No comments

Issue

I'm attempting to scrape some articles from wikipedia, and have found that there are some entries I wish to exclude.

In the case below I want to exclude the two a tags whose content equals either Archived or Wayback Machine. It's not necessary to have the text as the factor. I see that the href value is also usable as an exclusions on the url archive.org or /wiki/Wayback_Machine

<li id="cite_note-22">
    <span class="mw-cite-backlink">
        <b>
            <a href="#cite_ref-22" aria-label="Jump up" title="Jump up">^</a>
        </b>
    </span> 
    <span class="reference-text">
        <a rel="nofollow" class="external text" href="https://www.somelink.com">Article Text I want to keep</a> 
        <a rel="nofollow" class="external text" href="https://www.someotherlink.com">Archived</a>
        <a href="/wiki/Wayback_Machine" title="Wayback Machine">Wayback Machine</a>
    </span>
</li>

I've attempted to use decompose as below. But have found that this returns an error 'str' object has no attribute 'descendants'

removeWayback = BeautifulSoup.find_all('a', {'title':'Wayback Machine'})
removeArchive = BeautifulSoup.find(text="Archive")
removeWayback.decompose()
removeArchive.decompose()

removeWayback = BeautifulSoup.find_all('a', {'title':'Wayback Machine'})
File "/usr/local/lib/python3.8/site-packages/bs4/element.py", line 1780, in find_all generator = self.descendants
AttributeError: 'str' object has no attribute 'descendants'

I've also attempted to use exclude but I have similar issues.

Is there a better way to ignore these links?

Solution

You could try this:

import re
from bs4 import BeautifulSoup

html = """<li id="cite_note-22">
    <span class="mw-cite-backlink">
        <b>
            <a href="#cite_ref-22" aria-label="Jump up" title="Jump up">^</a>
        </b>
    </span> 
    <span class="reference-text">
        <a rel="nofollow" class="external text" href="https://www.somelink.com">Article Text I want to keep</a> 
        <a rel="nofollow" class="external text" href="https://www.someotherlink.com">Archived</a>
        <a href="/wiki/Wayback_Machine" title="Wayback Machine">Wayback Machine</a>
    </span>
</li>"""

soup = BeautifulSoup(html, "html.parser")
for anchor in soup.find_all(lambda t: t.name == 'a' and not re.search(r'Wayback|Archived|\^', t.text)):
    print(f"{anchor.text} - {anchor.get('href')}")

Output:

Article Text I want to keep - https://www.somelink.com

EDIT to answer the comment:

You'd match by class and text by using attrs= of .find_all() and dropping the regex condition into the loop.

soup = BeautifulSoup(html, "html.parser")
for anchor in soup.find_all("a", attrs={"class": "external text"}):
    if not re.search(r'Wayback|Archived', anchor.text):
        print(f"{anchor.text} - {anchor.get('href')}")

Output:

Article Text I want to keep - https://www.somelink.com

Answered By - baduker

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 6, 2021

[FIXED] BeautifulSoup How to remove tags whose text has specific value

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels