Issue
I am trying out webscraping using BeautifulSoup.
I only want extract the content from this webpage basically everything from Barry Kripke without all the headers..etc. https://bigbangtheory.fandom.com/wiki/Barry_Kripke
I tried this, but it doesn't give me what I want
quote = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke'
http = urllib3.PoolManager()
r = http.request('GET', quote)
if r.status == 200:
page = r.data
print('Type of the variable \'page\':', page.__class__.__name__)
print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
print('Some problem occurred. Request Status: %s' % r.status)
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)
print(soup.prettify()[:1000])
article_tag = 'p'
article = soup.find_all(article_tag)[0]
print(f'Type of the variable "article":{article.__class__.__name__}')
article.text
The output I get is below, which is just the first paragraph
Next I tried to get all the links, but that didn't work either - I got only 2 links:
for t in article.find_all('a'):
print(t)
Please can someone help me with this.
Solution
You only grab and print out the 1st <p>
tag with article = soup.find_all(article_tag)[0]
You need to go through all the <p>
tags:
import requests
from bs4 import BeautifulSoup
url = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke'
r = requests.get(url)
if r.status_code == 200:
page = r.text
print('Type of the variable \'page\':', page.__class__.__name__)
print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status_code, len(page)))
else:
print('Some problem occurred. Request Status: %s' % r.status_code)
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)
print(soup.prettify()[:1000])
article_tag = 'p'
articles = soup.find_all(article_tag)
for p in articles:
print(p.text)
Answered By - chitown88
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.