Saturday, October 22, 2022

[FIXED] How do I extract only the content from this webpage

October 22, 2022 beautifulsoup, python, web-scraping No comments

Issue

I am trying out webscraping using BeautifulSoup.

I only want extract the content from this webpage basically everything from Barry Kripke without all the headers..etc. https://bigbangtheory.fandom.com/wiki/Barry_Kripke

I tried this, but it doesn't give me what I want

quote = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke'
http = urllib3.PoolManager()
r = http.request('GET', quote)

if r.status == 200:
  page = r.data
  print('Type of the variable \'page\':', page.__class__.__name__)
  print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
  print('Some problem occurred. Request Status: %s' % r.status)

soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)

print(soup.prettify()[:1000])

article_tag = 'p'
article = soup.find_all(article_tag)[0]
print(f'Type of the variable "article":{article.__class__.__name__}')

article.text

The output I get is below, which is just the first paragraph

What I want is this:

Next I tried to get all the links, but that didn't work either - I got only 2 links:

for t in article.find_all('a'):
    print(t)

Please can someone help me with this.

Solution

You only grab and print out the 1st <p> tag with article = soup.find_all(article_tag)[0]

You need to go through all the <p> tags:

import requests
from bs4 import BeautifulSoup

url = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke'
r = requests.get(url)

if r.status_code == 200:
  page = r.text
  print('Type of the variable \'page\':', page.__class__.__name__)
  print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status_code, len(page)))
else:
  print('Some problem occurred. Request Status: %s' % r.status_code)

soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)

print(soup.prettify()[:1000])

article_tag = 'p'
articles = soup.find_all(article_tag)

for p in articles:
    print(p.text)

Answered By - chitown88

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 22, 2022

[FIXED] How do I extract only the content from this webpage

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels