Issue
I need to write a program in python using bs4 that shows me the path of one wikipedia site to another, for this I have to take the first link on the current wikipedia site that is located in the 'div', id=bodycontent in the first paragraph.
Though, there is a restriction that I have to take the first link that is not located between brackets fe:
Epistemology (/ɪˌpɪstəˈmɒlədʒi/ (listen); from Ancient Greek ἐπιστήμη (epistḗmē) 'knowledge', and -logy) is the branch of philosophy concerned with knowledge. Epistemologists study the nature, origin, and scope of knowledge, epistemic justification, the rationality of belief, and various related issues. Epistemology is considered a major subfield of philosophy, along with other major subfields such as ethics, logic, and metaphysics.[1]
In this paragraph, Ancient Greek is a link but I cant use that one because it's between brackets so the link that I have to use is branch of philosophy (https://en.wikipedia.org/wiki/Epistemology).
My problem is that I don't know how I can find the first link that is not between brackets in the first paragraph. This is what I already have:
while current != end:
session.get(url)
response = session.get(url)
d = BeautifulSoup(response.content, 'html.parser')
body = d.find('div',id = "bodyContent")
d = body.find("p")
while d.find('a') == None:
d = d.findNext("p")
d = d.find("a")```
Solution
One solution could be get the first paragraph as text, replace (
and )
with custom tags and parse it again. For example:
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Epistemology"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
txt = (
str(soup.select_one(".mw-parser-output p:has(a)"))
.replace("(", "<bracket>")
.replace(")", "</bracket>")
)
soup = BeautifulSoup(txt, "html.parser")
a = soup.find(lambda tag: tag.name == "a" and not tag.find_parent("bracket"))
print(a)
Prints:
<a href="/wiki/Outline_of_philosophy" title="Outline of philosophy">branch of philosophy</a>
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.