Issue
I have scraped a page and gathered the URLs that on on the site. I was then trying to pass those found URLs back into BS to open them and scrape them for
. This might be something super simple I am missing and I apologize for that. I've only started python this last week.Here is my code.
from bs4 import BeautifulSoup
import requests
# gets website
html_text = requests.get('https://www.marketwatch.com/latest-news?mod=top_nav').text
# uses BeautifulSoup to read the html
website = BeautifulSoup(html_text, 'lxml')
# searches the html that BS read into py
stories = website.find('div', class_='collection__elements j-scrollElement')
# searches stories for a single story
story = stories.find('div', class_='element element--article')
# searches story for the headline
article_headline = story.find('h3', class_='article__headline').text
# get link to article from the head line to then pass through to get the articles content
possible_links = story.find_all('a', class_='link')
for link in possible_links:
linkArt = (link.get('href'))
print('Link:' + linkArt) # make sure expected link was found
# Pass found link back into a request (part that doesn't work)
article_html = requests.get(linkArt).text
article = BeautifulSoup(article_html, ' lxml')
article_wrapper = article.find('div', class_ = 'column column--full article__content')
article_content = article_wrapper.find_all('p').text
This is my error: Link:https://www.marketwatch.com/articles/global-stocks-edge-lower-with-biden-adopting-traditional-dollar-stance-chinese-equities-gain-after-gdp-report-51610967473?mod=newsviewer_click Traceback (most recent call last): File "C:\Users\MLG420\PycharmProjects\scraper\scrape.py", line 23, in article = BeautifulSoup(article_html, ' lxml') File "C:\Users\MLG420\PycharmProjects\scraper\venv\lib\site-packages\bs4_init_.py", line 243, in init raise FeatureNotFound( bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
Solution
There were 2 errors: there is a space in " lxml"
that breaks the code in the declaration of article_html
and in the last line you are calling .text
on a list.
Here's the corrected code:
from bs4 import BeautifulSoup
import requests
import lxml
# gets website
html_text = requests.get('https://www.marketwatch.com/latest-news?mod=top_nav').text
# uses BeautifulSoup to read the html
website = BeautifulSoup(html_text, 'lxml')
# searches the html that BS read into py
stories = website.find('div', class_='collection__elements j-scrollElement')
# searches stories for a single story
story = stories.find('div', class_='element element--article')
# searches story for the headline
article_headline = story.find('h3', class_='article__headline').text
# get link to article from the head line to then pass through to get the articles content
possible_links = story.find_all('a', class_='link')
for link in possible_links:
linkArt = (link.get('href'))
print('Link:' + linkArt) # make sure expected link was found
# Pass found link back into a request (part that doesn't work)
article_html = requests.get(linkArt).text
article = BeautifulSoup(article_html, 'lxml')
article_wrapper = article.find('div', class_ = 'column column--full article__content')
article_content = article_wrapper.find_all('p')
for elem in article_content:
print(elem.text) # or do something else
Answered By - frab
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.