Issue
I'm attempting to scrape some articles from wikipedia, and have found that there are some entries I wish to exclude.
In the case below I want to exclude the two a
tags whose content equals either Archived
or Wayback Machine
. It's not necessary to have the text as the factor. I see that the href value is also usable as an exclusions on the url archive.org
or /wiki/Wayback_Machine
<li id="cite_note-22">
<span class="mw-cite-backlink">
<b>
<a href="#cite_ref-22" aria-label="Jump up" title="Jump up">^</a>
</b>
</span>
<span class="reference-text">
<a rel="nofollow" class="external text" href="https://www.somelink.com">Article Text I want to keep</a>
<a rel="nofollow" class="external text" href="https://www.someotherlink.com">Archived</a>
<a href="/wiki/Wayback_Machine" title="Wayback Machine">Wayback Machine</a>
</span>
</li>
I've attempted to use decompose as below. But have found that this returns an error 'str' object has no attribute 'descendants'
removeWayback = BeautifulSoup.find_all('a', {'title':'Wayback Machine'})
removeArchive = BeautifulSoup.find(text="Archive")
removeWayback.decompose()
removeArchive.decompose()
removeWayback = BeautifulSoup.find_all('a', {'title':'Wayback Machine'})
File "/usr/local/lib/python3.8/site-packages/bs4/element.py", line 1780, in find_all generator = self.descendants
AttributeError: 'str' object has no attribute 'descendants'
I've also attempted to use exclude
but I have similar issues.
Is there a better way to ignore these links?
Solution
You could try this:
import re
from bs4 import BeautifulSoup
html = """<li id="cite_note-22">
<span class="mw-cite-backlink">
<b>
<a href="#cite_ref-22" aria-label="Jump up" title="Jump up">^</a>
</b>
</span>
<span class="reference-text">
<a rel="nofollow" class="external text" href="https://www.somelink.com">Article Text I want to keep</a>
<a rel="nofollow" class="external text" href="https://www.someotherlink.com">Archived</a>
<a href="/wiki/Wayback_Machine" title="Wayback Machine">Wayback Machine</a>
</span>
</li>"""
soup = BeautifulSoup(html, "html.parser")
for anchor in soup.find_all(lambda t: t.name == 'a' and not re.search(r'Wayback|Archived|\^', t.text)):
print(f"{anchor.text} - {anchor.get('href')}")
Output:
Article Text I want to keep - https://www.somelink.com
EDIT to answer the comment:
You'd match by class
and text
by using attrs=
of .find_all()
and dropping the regex condition into the loop.
soup = BeautifulSoup(html, "html.parser")
for anchor in soup.find_all("a", attrs={"class": "external text"}):
if not re.search(r'Wayback|Archived', anchor.text):
print(f"{anchor.text} - {anchor.get('href')}")
Output:
Article Text I want to keep - https://www.somelink.com
Answered By - baduker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.