Friday, November 11, 2022

[FIXED] How to scrape second <p> of webpage using python and Beautifulsoup

November 11, 2022 beautifulsoup, html, python No comments

Issue

I've been trying to work with BeautifulSoup because I want to try and scrape a webpage (https://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1). So far I scraped some elements with success but now I wanted to scrape a movie description but I've been struggling. The description is simply situated like this in html :

<div class="lister-item mode-advanced"> 
    <div class="lister-item-content> 
       <p class="muted-text"> paragraph I don't need</p>
       <p class="muted-text"> paragraph I need</p>
    </div>
</div>

I want to scrape the second paragraph which seemed easy to do but everything I tried gave me 'None' as output. I've been digging around to find an answer. In an other stackoverflow post I found that

find('p:nth-of-type(1)')

find_elements_by_css_selector('.lister-item-mode >p:nth-child(1)')

could do the trick but it still gives me

none #as output

Below you can find a piece of my code it's a bit low grade code because I'm just trying out stuff to learn

 import urllib2
from bs4 import BeautifulSoup
from requests import get

url = 'http://www.imdb.com/search/title? 
release_date=2017&sort=num_votes,desc&page=1'
response = get(url)

html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_='lister-item mode- 
advanced')

first_movie = movie_containers[0]

first_title = first_movie.h3.a.text
print first_title

first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year = first_year.text
print first_year

first_imdb = float(first_movie.strong.text)
print first_imdb

# !!!! problem zone ---------------------------------------------
first_description = first_movie.find('p', class_='muted-text')
#first_description = first_description.text
print first_description

the above code gives me this output:

$ python scrape.py
Logan
(2017)
8.1
None

I would like to learn the correct method of selecting html tags because it will be useful to know for future projects.

Solution

find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.

You can then use the list's index to get the element you need. Index starts at 0, so 1 will give the second item.

Change the first_description to this.

first_description = first_movie.find_all('p', {"class":"text-muted"})[1].text.strip()

Full code

import urllib2
from bs4 import BeautifulSoup
from requests import get

url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'
response = get(url)

html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_='lister-item mode-advanced')

first_movie = movie_containers[0]

first_title = first_movie.h3.a.text
print first_title

first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year = first_year.text
print first_year

first_imdb = float(first_movie.strong.text)
print first_imdb

# !!!! problem zone ---------------------------------------------
first_description = first_movie.find_all('p', {"class":"text-muted"})[1].text.strip()
#first_description = first_description.text
print first_description

Output

Logan
(2017)
8.1
In the near future, a weary Logan cares for an ailing Professor X. However, Logan's attempts to hide from the world and his legacy are upended when a young mutant arrives, pursued by dark forces.

Read the Documentation to learn the correct method of selecting html tags.

Also consider moving to python 3.

Answered By - Bitto Bennichan

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 11, 2022

[FIXED] How to scrape second <p> of webpage using python and Beautifulsoup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels