Tuesday, October 4, 2022

[FIXED] How to avoid attribute error while extracting links from articles?

October 04, 2022 beautifulsoup, html, list, python, web-scraping No comments

Issue

I am trying my hand at webscraping using BeautifulSoup.

I had posted this before here, but I was not very clear as to what I wanted, so it only partially answers my issue. How do I extract only the content from this webpage

I want to extract the content from the webpage and then extract all the links from the output. Please can someone help me understand where I am going wrong.

This is what I have after updating my previous code with the answer provided in the link above.

# Define the content to retrieve (webpage's URL)
quote_page = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke'

# Retrieve the page
http = urllib3.PoolManager()
r = http.request('GET', quote_page)

if r.status == 200:
    page = r.data
    print(f'Type of Variable "page": {page.__class__.__name__}')
    print(f'Page Retrieved. Request Status: {r.status}, Page Size:{len(page)}')
else:
    print(f'Some problem occured. Request status: {r.status}')

# Convert the stream of bytes into a BeautifulSoup representation
soup = BeautifulSoup(page, 'html.parser')
print(f'Type of variable "soup": {soup.__class__.__name__}')

# Check the content
print(f'{soup. Prettify()[:1000]}')

# Check the HTML's Title
print(f'Title tag: {soup.title}')
print(f'Title text: {soup.title.string}')

# Find the main content
article_tag = 'p'
articles = soup.find_all(article_tag)
print(f'Type of the variable "article":{article.__class__.__name__}')

for p in articles:
    print (p.text)

I then used the code below to get all the links, but get an error

# Find the links in the text
# identify the type of tag to retrieve
tag = 'a'

# create a list with the links from the `<a>` tag
tag_list = [t.get('href') for t in articles.find_all(tag)]
tag_list

Solution

That is cause articles is a ResultSet of soup.find_all(article_tag) what you can check with type(articles)

To get your goal you have to iterate articles first - So simply add an additional for-loop to your list comprehension:

[t.get('href') for article in articles for t in article.find_all(tag)]

In addition you may should use a set comprehension to avoid duplicates and also concat paths with base url:

list(set(t.get('href') if t.get('href').startswith('http') else 'https://bigbangtheory.fandom.com'+t.get('href') for article in articles for t in article.find_all(tag)))

Output:

['https://bigbangtheory.fandom.com/wiki/The_Killer_Robot_Instability',
 'https://bigbangtheory.fandom.com/wiki/Rajesh_Koothrappali',
 'https://bigbangtheory.fandom.com/wiki/Bernadette_Rostenkowski-Wolowitz',
 'https://bigbangtheory.fandom.com/wiki/The_Valentino_Submergence',
 'https://bigbangtheory.fandom.com/wiki/The_Beta_Test_Initiation',
 'https://bigbangtheory.fandom.com/wiki/Season_2',
 'https://bigbangtheory.fandom.com/wiki/Dr._Pemberton',...]

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, October 4, 2022

[FIXED] How to avoid attribute error while extracting links from articles?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels