Saturday, May 7, 2022

[FIXED] How to extract all the hrefs and src inside specific divs with beautifulsoup python

May 07, 2022 beautifulsoup, parsing, python No comments

Issue

I want to extract all the href and src inside all the divs on the page that have class = 'news_item'

The html looks like this:

<div class="col">
<div class="group">
<h4>News</h4>
<div class="news_item">

<a href="www.link.com">

<h2 class="link">
here is a link-heading
</h2>
<div class="Img">
<img border="0" src="/image/link" />
</div>
<p></p>
</a>
</div>

from here what I want to extract is:

www.link.com , here is the link-heading and /image/link

My code is:

 def scrape_a(url):

        news_links = soup.select("div.news_item [href]")
        for links in news_links:
          if news_links:
            return 'http://www.web.com' + news_links['href']

    def scrape_headings(url):
        for news_headings in soup.select("h2.link"):
          return str(news_headings.string.strip())


    def scrape_images(url):
        images = soup.select("div.Img[src]")
        for image in images:
          if images:
            return 'http://www.web.com' + news_links['src']


    def top_stories():


    r = requests.get(url)
  soup = BeautifulSoup(r.content)
  link = scrape_a(soup)
  heading = scrape_headings(soup)
  image = scrape_images(soup)
  message = {'heading': heading, 'link': link, 'image': image}
  print message

The problem is that it gives me error:

    **TypeError: 'NoneType' object is not callable**

Here is the Traceback:

Traceback (most recent call last):
  File "web_parser.py", line 40, in <module>
    top_stories()
  File "web_parser.py", line 32, in top_stories
    link = scrape_a('www.link.com')
  File "web_parser.py", line 10, in scrape_a
    news_links = soup.select_all("div.news_item [href]")

Solution

You should be grabbing all of the news items at once and then iterating through them. This makes it easy to organize the data that you get into manageable chunks (in this case dicts). Try something like this

url = "http://www.web.com"
r = requests.get(url)
soup = BeautifulSoup(r.text)

messages = []

news_links = soup.select("div.news_item") # selects all .news_item's
for l in news_links:
    message = {}
    message['heading'] = l.find("h2").text.strip()

    link = l.find("a")
    if not link:
        continue
    message['link'] = link['href']
    
    image = l.find('img')
    if not image:
        continue
    message['image'] = "http://www.web.com{}".format(image['src'])

    messages.append(message)

print messages

Answered By - wpercy

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, May 7, 2022

[FIXED] How to extract all the hrefs and src inside specific divs with beautifulsoup python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels