Issue
I want to extract all the href and src inside all the divs on the page that have class = 'news_item'
The html looks like this:
<div class="col">
<div class="group">
<h4>News</h4>
<div class="news_item">
<a href="www.link.com">
<h2 class="link">
here is a link-heading
</h2>
<div class="Img">
<img border="0" src="/image/link" />
</div>
<p></p>
</a>
</div>
from here what I want to extract is:
www.link.com , here is the link-heading and /image/link
My code is:
def scrape_a(url):
news_links = soup.select("div.news_item [href]")
for links in news_links:
if news_links:
return 'http://www.web.com' + news_links['href']
def scrape_headings(url):
for news_headings in soup.select("h2.link"):
return str(news_headings.string.strip())
def scrape_images(url):
images = soup.select("div.Img[src]")
for image in images:
if images:
return 'http://www.web.com' + news_links['src']
def top_stories():
r = requests.get(url)
soup = BeautifulSoup(r.content)
link = scrape_a(soup)
heading = scrape_headings(soup)
image = scrape_images(soup)
message = {'heading': heading, 'link': link, 'image': image}
print message
The problem is that it gives me error:
**TypeError: 'NoneType' object is not callable**
Here is the Traceback:
Traceback (most recent call last):
File "web_parser.py", line 40, in <module>
top_stories()
File "web_parser.py", line 32, in top_stories
link = scrape_a('www.link.com')
File "web_parser.py", line 10, in scrape_a
news_links = soup.select_all("div.news_item [href]")
Solution
You should be grabbing all of the news items at once and then iterating through them. This makes it easy to organize the data that you get into manageable chunks (in this case dicts). Try something like this
url = "http://www.web.com"
r = requests.get(url)
soup = BeautifulSoup(r.text)
messages = []
news_links = soup.select("div.news_item") # selects all .news_item's
for l in news_links:
message = {}
message['heading'] = l.find("h2").text.strip()
link = l.find("a")
if not link:
continue
message['link'] = link['href']
image = l.find('img')
if not image:
continue
message['image'] = "http://www.web.com{}".format(image['src'])
messages.append(message)
print messages
Answered By - wpercy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.