Monday, December 12, 2022

[FIXED] Crawl the first paragraph link in wiki

December 12, 2022 beautifulsoup, python-3.x, python-requests No comments

Issue

how to crawl the first paragraph link in wiki?

All the links in the parentheses should be excluded. As an example i provide the follow link:

https://en.wikipedia.org/wiki/Data.

On this page the first link i want to crawl is "qualitative" (href="/wiki/Qualitative_property"). My code has excluded all the special links like footnotes and pronunciation but can't exclude the normal link in the parentheses.

import requests
from bs4 import BeautifulSoup
response = requests.get('https://en.wikipedia.org/wiki/Data')
html = response.text
soup = BeautifulSoup(html, "html.parser")
link = soup.find(id='mw-content-text').find(class_="mw-parser-output").find_all('p', recursive=False)
list_a = []
for element in link:
    if element.find("a", recursive=False):
        print(element.find("a", recursive=False).get('href'))
        break

Solution

Well, technically speaking, those links are not different from the links outside the parentheses. If you look closer at the href attribute of those links, all of them begin with /wiki/Help: so, you can leave them out if that happens. In the code below I used regular expressions for doing that:

Code

import re
import requests
from bs4 import BeautifulSoup
response = requests.get('https://en.wikipedia.org/wiki/Data')
html = response.text
soup = BeautifulSoup(html, "html.parser")
paragraphs = soup.find(id='mw-content-text').find(class_="mw-parser-output").find_all('p', recursive=False)
list_a = []

# Help links RegEx
help_link_regex = re.compile('^/wiki/Help:')

for p in paragraphs:
    p_links = p.find_all("a", recursive=False)

    for link in p_links:
        # Leave them out if they match the previous RegEx
        if not help_link_regex.match(link.get('href')):
            print(link.get('href'))
            list_a.append(link.get('href'))
            break

Output

/wiki/Qualitative_property
/wiki/Information
/wiki/Measurement
/wiki/Data_(word)
/wiki/Information
/wiki/Knowledge
/wiki/Sign
/wiki/Marketing
/wiki/Analog_computer
/wiki/Johanna_Drucker

Note the first link in this list is the first link (outside parentheses) in the first paragraph: the link you wanted.

The previous code just adds the first non-help link of each paragraph to list_a, if you want to get them all, just remove the break:

Output (after removing the `break`)

/wiki/Qualitative_property
/wiki/Quantitative_data
/wiki/Variable_(research)
/wiki/Information
/wiki/Scientific_research
/wiki/Stock_price
/wiki/Crime_rate
/wiki/Unemployment_rate
/wiki/Literacy
/wiki/Homelessness
/wiki/Measurement
/wiki/Data_reporting
/wiki/Data_analysis
/wiki/Data_visualization
/wiki/Concept
/wiki/Information
/wiki/Knowledge
/wiki/Data_processing
/wiki/Number
/wiki/Character_(computing)
/wiki/Outlier
/wiki/Field_work
/wiki/In_situ
/wiki/Experimental_data
/wiki/Petroleum
/wiki/Digital_economy
/wiki/Data_(word)
/wiki/Mass_noun
/wiki/Information
/wiki/Knowledge
/wiki/Wisdom
/wiki/Shannon_entropy
/wiki/Knowledge
/wiki/Mount_Everest
/wiki/Altimeter
/wiki/Sign
/wiki/Marketing
/wiki/Social_services
/wiki/Truth
/wiki/Analog_computer
/wiki/Computer
/wiki/Alphabet
/wiki/Computer_program
/wiki/Lisp_(programming_language)
/wiki/Metadata
/wiki/Johanna_Drucker

I hope this helps you, otherwise, let me know what went wrong.

Answered By - JoshuaCS

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 12, 2022

[FIXED] Crawl the first paragraph link in wiki

Issue

Solution

Code

Output

Output (after removing the `break`)

0 comments:

Post a Comment

Popular Posts

Labels

Monday, December 12, 2022

Issue

Solution

Code

Output

Output (after removing the break)

0 comments:

Post a Comment

Popular Posts

Labels

Output (after removing the `break`)