Thursday, December 2, 2021

[FIXED] How do I pass a scraped URL back into BeautifulSoup?

December 02, 2021 beautifulsoup, html, python No comments

Issue

I have scraped a page and gathered the URLs that on on the site. I was then trying to pass those found URLs back into BS to open them and scrape them for

. This might be something super simple I am missing and I apologize for that. I've only started python this last week.

Here is my code.

    from bs4 import BeautifulSoup
    import requests

    # gets website
    html_text = requests.get('https://www.marketwatch.com/latest-news?mod=top_nav').text
    # uses BeautifulSoup to read the html
    website = BeautifulSoup(html_text, 'lxml')
    # searches the html that BS read into py
    stories = website.find('div', class_='collection__elements j-scrollElement')
    # searches stories for a single story
    story = stories.find('div', class_='element element--article')
    # searches story for the headline
    article_headline = story.find('h3', class_='article__headline').text
    # get link to article from the head line to then pass through to get the articles content
    possible_links = story.find_all('a', class_='link')
    for link in possible_links:
        linkArt = (link.get('href'))
    print('Link:' + linkArt) # make sure expected link was found

    # Pass found link back into a request (part that doesn't work)
    article_html = requests.get(linkArt).text
    article = BeautifulSoup(article_html, ' lxml')

    article_wrapper = article.find('div', class_ = 'column column--full article__content')
    article_content = article_wrapper.find_all('p').text

This is my error: Link:https://www.marketwatch.com/articles/global-stocks-edge-lower-with-biden-adopting-traditional-dollar-stance-chinese-equities-gain-after-gdp-report-51610967473?mod=newsviewer_click Traceback (most recent call last): File "C:\Users\MLG420\PycharmProjects\scraper\scrape.py", line 23, in article = BeautifulSoup(article_html, ' lxml') File "C:\Users\MLG420\PycharmProjects\scraper\venv\lib\site-packages\bs4_init_.py", line 243, in init raise FeatureNotFound( bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

Solution

There were 2 errors: there is a space in " lxml" that breaks the code in the declaration of article_html and in the last line you are calling .text on a list. Here's the corrected code:

from bs4 import BeautifulSoup
import requests
import lxml

# gets website
html_text = requests.get('https://www.marketwatch.com/latest-news?mod=top_nav').text
# uses BeautifulSoup to read the html
website = BeautifulSoup(html_text, 'lxml')
# searches the html that BS read into py
stories = website.find('div', class_='collection__elements j-scrollElement')
# searches stories for a single story
story = stories.find('div', class_='element element--article')
# searches story for the headline
article_headline = story.find('h3', class_='article__headline').text
# get link to article from the head line to then pass through to get the articles content
possible_links = story.find_all('a', class_='link')
for link in possible_links:
    linkArt = (link.get('href'))
print('Link:' + linkArt) # make sure expected link was found

# Pass found link back into a request (part that doesn't work)
article_html = requests.get(linkArt).text
article = BeautifulSoup(article_html, 'lxml')

article_wrapper = article.find('div', class_ = 'column column--full article__content')
article_content = article_wrapper.find_all('p')
for elem in article_content:
    print(elem.text)  # or do something else

Answered By - frab

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, December 2, 2021

[FIXED] How do I pass a scraped URL back into BeautifulSoup?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels