Issue
how to crawl the first paragraph link in wiki?
All the links in the parentheses should be excluded. As an example i provide the follow link:
https://en.wikipedia.org/wiki/Data.
On this page the first link i want to crawl is "qualitative" (href="/wiki/Qualitative_property"). My code has excluded all the special links like footnotes and pronunciation but can't exclude the normal link in the parentheses.
import requests
from bs4 import BeautifulSoup
response = requests.get('https://en.wikipedia.org/wiki/Data')
html = response.text
soup = BeautifulSoup(html, "html.parser")
link = soup.find(id='mw-content-text').find(class_="mw-parser-output").find_all('p', recursive=False)
list_a = []
for element in link:
if element.find("a", recursive=False):
print(element.find("a", recursive=False).get('href'))
break
Solution
Well, technically speaking, those links are not different from the links outside the parentheses. If you look closer at the href attribute of those links, all of them begin with /wiki/Help: so, you can leave them out if that happens. In the code below I used regular expressions for doing that:
Code
import re
import requests
from bs4 import BeautifulSoup
response = requests.get('https://en.wikipedia.org/wiki/Data')
html = response.text
soup = BeautifulSoup(html, "html.parser")
paragraphs = soup.find(id='mw-content-text').find(class_="mw-parser-output").find_all('p', recursive=False)
list_a = []
# Help links RegEx
help_link_regex = re.compile('^/wiki/Help:')
for p in paragraphs:
p_links = p.find_all("a", recursive=False)
for link in p_links:
# Leave them out if they match the previous RegEx
if not help_link_regex.match(link.get('href')):
print(link.get('href'))
list_a.append(link.get('href'))
break
Output
/wiki/Qualitative_property
/wiki/Information
/wiki/Measurement
/wiki/Data_(word)
/wiki/Information
/wiki/Knowledge
/wiki/Sign
/wiki/Marketing
/wiki/Analog_computer
/wiki/Johanna_Drucker
Note the first link in this list is the first link (outside parentheses) in the first paragraph: the link you wanted.
The previous code just adds the first non-help link of each paragraph to list_a
, if you want to get them all, just remove the break
:
Output (after removing the break
)
/wiki/Qualitative_property
/wiki/Quantitative_data
/wiki/Variable_(research)
/wiki/Information
/wiki/Scientific_research
/wiki/Stock_price
/wiki/Crime_rate
/wiki/Unemployment_rate
/wiki/Literacy
/wiki/Homelessness
/wiki/Measurement
/wiki/Data_reporting
/wiki/Data_analysis
/wiki/Data_visualization
/wiki/Concept
/wiki/Information
/wiki/Knowledge
/wiki/Data_processing
/wiki/Number
/wiki/Character_(computing)
/wiki/Outlier
/wiki/Field_work
/wiki/In_situ
/wiki/Experimental_data
/wiki/Petroleum
/wiki/Digital_economy
/wiki/Data_(word)
/wiki/Mass_noun
/wiki/Information
/wiki/Knowledge
/wiki/Wisdom
/wiki/Shannon_entropy
/wiki/Knowledge
/wiki/Mount_Everest
/wiki/Altimeter
/wiki/Sign
/wiki/Marketing
/wiki/Social_services
/wiki/Truth
/wiki/Analog_computer
/wiki/Computer
/wiki/Alphabet
/wiki/Computer_program
/wiki/Lisp_(programming_language)
/wiki/Metadata
/wiki/Johanna_Drucker
I hope this helps you, otherwise, let me know what went wrong.
Answered By - JoshuaCS
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.