Issue
I am trying my hand at webscraping using BeautifulSoup
.
I had posted this before here, but I was not very clear as to what I wanted, so it only partially answers my issue. How do I extract only the content from this webpage
I want to extract the content from the webpage and then extract all the links from the output. Please can someone help me understand where I am going wrong.
This is what I have after updating my previous code with the answer provided in the link above.
# Define the content to retrieve (webpage's URL)
quote_page = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke'
# Retrieve the page
http = urllib3.PoolManager()
r = http.request('GET', quote_page)
if r.status == 200:
page = r.data
print(f'Type of Variable "page": {page.__class__.__name__}')
print(f'Page Retrieved. Request Status: {r.status}, Page Size:{len(page)}')
else:
print(f'Some problem occured. Request status: {r.status}')
# Convert the stream of bytes into a BeautifulSoup representation
soup = BeautifulSoup(page, 'html.parser')
print(f'Type of variable "soup": {soup.__class__.__name__}')
# Check the content
print(f'{soup. Prettify()[:1000]}')
# Check the HTML's Title
print(f'Title tag: {soup.title}')
print(f'Title text: {soup.title.string}')
# Find the main content
article_tag = 'p'
articles = soup.find_all(article_tag)
print(f'Type of the variable "article":{article.__class__.__name__}')
for p in articles:
print (p.text)
I then used the code below to get all the links, but get an error
# Find the links in the text
# identify the type of tag to retrieve
tag = 'a'
# create a list with the links from the `<a>` tag
tag_list = [t.get('href') for t in articles.find_all(tag)]
tag_list
Solution
That is cause articles
is a ResultSet
of soup.find_all(article_tag)
what you can check with type(articles)
To get your goal you have to iterate articles
first - So simply add an additional for-loop
to your list comprehension
:
[t.get('href') for article in articles for t in article.find_all(tag)]
In addition you may should use a set comprehension
to avoid duplicates and also concat paths with base url:
list(set(t.get('href') if t.get('href').startswith('http') else 'https://bigbangtheory.fandom.com'+t.get('href') for article in articles for t in article.find_all(tag)))
Output:
['https://bigbangtheory.fandom.com/wiki/The_Killer_Robot_Instability',
'https://bigbangtheory.fandom.com/wiki/Rajesh_Koothrappali',
'https://bigbangtheory.fandom.com/wiki/Bernadette_Rostenkowski-Wolowitz',
'https://bigbangtheory.fandom.com/wiki/The_Valentino_Submergence',
'https://bigbangtheory.fandom.com/wiki/The_Beta_Test_Initiation',
'https://bigbangtheory.fandom.com/wiki/Season_2',
'https://bigbangtheory.fandom.com/wiki/Dr._Pemberton',...]
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.