Issue
I've been trying to work with BeautifulSoup because I want to try and scrape a webpage (https://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1). So far I scraped some elements with success but now I wanted to scrape a movie description but I've been struggling. The description is simply situated like this in html :
<div class="lister-item mode-advanced">
<div class="lister-item-content>
<p class="muted-text"> paragraph I don't need</p>
<p class="muted-text"> paragraph I need</p>
</div>
</div>
I want to scrape the second paragraph which seemed easy to do but everything I tried gave me 'None' as output. I've been digging around to find an answer. In an other stackoverflow post I found that
find('p:nth-of-type(1)')
or
find_elements_by_css_selector('.lister-item-mode >p:nth-child(1)')
could do the trick but it still gives me
none #as output
Below you can find a piece of my code it's a bit low grade code because I'm just trying out stuff to learn
import urllib2
from bs4 import BeautifulSoup
from requests import get
url = 'http://www.imdb.com/search/title?
release_date=2017&sort=num_votes,desc&page=1'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_='lister-item mode-
advanced')
first_movie = movie_containers[0]
first_title = first_movie.h3.a.text
print first_title
first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year = first_year.text
print first_year
first_imdb = float(first_movie.strong.text)
print first_imdb
# !!!! problem zone ---------------------------------------------
first_description = first_movie.find('p', class_='muted-text')
#first_description = first_description.text
print first_description
the above code gives me this output:
$ python scrape.py
Logan
(2017)
8.1
None
I would like to learn the correct method of selecting html tags because it will be useful to know for future projects.
Solution
find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.
You can then use the list's index to get the element you need. Index starts at 0, so 1 will give the second item.
Change the first_description to this.
first_description = first_movie.find_all('p', {"class":"text-muted"})[1].text.strip()
Full code
import urllib2
from bs4 import BeautifulSoup
from requests import get
url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_='lister-item mode-advanced')
first_movie = movie_containers[0]
first_title = first_movie.h3.a.text
print first_title
first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year = first_year.text
print first_year
first_imdb = float(first_movie.strong.text)
print first_imdb
# !!!! problem zone ---------------------------------------------
first_description = first_movie.find_all('p', {"class":"text-muted"})[1].text.strip()
#first_description = first_description.text
print first_description
Output
Logan
(2017)
8.1
In the near future, a weary Logan cares for an ailing Professor X. However, Logan's attempts to hide from the world and his legacy are upended when a young mutant arrives, pursued by dark forces.
Read the Documentation to learn the correct method of selecting html tags.
Also consider moving to python 3.
Answered By - Bitto Bennichan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.