Saturday, December 4, 2021

[FIXED] Obtaining just the last row when using beautiful soup

December 04, 2021 beautifulsoup, pandas, python, web-scraping No comments

Issue

I have the following code:

from bs4 import BeautifulSoup

import requests

import pandas as pd

def Get_Top_List_BR(url):
    
    
    response = requests.get(url)

    page = response.text

    soup = BeautifulSoup(page)

    table = soup.find(id='table')
   
    rows = [row for row in table.find_all('tr')]
    

    movies = {}

    for row in rows[1:]:
        items = row.find_all('td')
        link = items[1].find('a')
        title, url_string = link.text, link['href']
        #split url string into unique movie serial number
        url = url_string.split('?', 1)[0].split('t', 4)[-1].split('/', 1)[0]
        #set serial number as key to avoid duplication in any other category-especially title
        movies[url] = [url_string] +[i.text for i in items]
   
    movie_page = pd.DataFrame(movies).T  #transpose
    movie_page.columns = ['URL', 'Rank', 'Title', 'Genre', 'Budget', 'Running Time','Gross',
                    'Theaters', 'Total_Gross', 'Release_Date', 'Distributor', 'Estimated']

    return movie_page

df_test_BR = Get_Top_List_BR('https://www.boxofficemojo.com/year/2019/?grossesOption=calendarGrosses&area=BR/')

df_test_BR.head(10)

Problem: I am only getting the last row. Question: How can I fix it to return all the rows?

Solution

First, I'm not sure as to what Python version you are using but how you implement BeautifulSoup is incorrect, at least in my version. BeautifulSoup heavily recommends using a parser here. Your following code here:

 response = requests.get(url)
 page = response.text
 soup = BeautifulSoup(page)
 table = soup.find(id='table')

should be:

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find(id='table')

What your issue is how you define the url inside the for-loop. I managed to loop through all the elements, but how you define the url is specifically the issue. The way you read the define the url inside the for-loop returns blankspace.

So you say it returns just the last item. When it gets' to the last item, it'll fetch the url in the for-loop. But the url is just blankspace, and the key already exists in movies. Therefore, it'll overwrite the existing data there.

I'm not sure how you wanted the url defined, but this code does as you intend - fetch all the movies, their names, href values, and return the first 10. The only differences should be how you define url and movies[url], but be careful not to trip up on the url again.

Also, the way you redefine url within the for-loop to represent a unique-ID should reflect that, as such - name it unique_id (or, in this example uid). I also included print statements to demonstrate it goes through the entire loop and also gets the first 10 values.

def Get_Top_List_GR(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find(id='table')
    rows = [row for row in table.find_all('tr')]

    movies = {}
    for row in rows[1:]:
        items = row.find_all('td')
        link = items[1].find('a')
        title, url_string = link.text, link['href']
        # split url string into unique movie serial number
        uid = url_string.split("/")[-2]
        print("{0} - {1} - {2}".format(url, title, uid))
        # set serial number as key to avoid duplication in any other category-        especially title
        movies[uid] = [url_string] + [i.text for i in items]
    movie_page = pd.DataFrame(movies).T  # transpose
    return movie_page

df_test_ = Get_Top_List_GR('https://www.boxofficemojo.com/year/2019/?grossesOption=calendarGrosses&area=BR/')
print(df_test_.head(10))

Answered By - astridonkey

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, December 4, 2021

[FIXED] Obtaining just the last row when using beautiful soup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels