Issue
I have the following code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
def Get_Top_List_BR(url):
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page)
table = soup.find(id='table')
rows = [row for row in table.find_all('tr')]
movies = {}
for row in rows[1:]:
items = row.find_all('td')
link = items[1].find('a')
title, url_string = link.text, link['href']
#split url string into unique movie serial number
url = url_string.split('?', 1)[0].split('t', 4)[-1].split('/', 1)[0]
#set serial number as key to avoid duplication in any other category-especially title
movies[url] = [url_string] +[i.text for i in items]
movie_page = pd.DataFrame(movies).T #transpose
movie_page.columns = ['URL', 'Rank', 'Title', 'Genre', 'Budget', 'Running Time','Gross',
'Theaters', 'Total_Gross', 'Release_Date', 'Distributor', 'Estimated']
return movie_page
df_test_BR = Get_Top_List_BR('https://www.boxofficemojo.com/year/2019/?grossesOption=calendarGrosses&area=BR/')
df_test_BR.head(10)
Problem: I am only getting the last row. Question: How can I fix it to return all the rows?
Solution
First, I'm not sure as to what Python version you are using but how you implement BeautifulSoup is incorrect, at least in my version. BeautifulSoup heavily recommends using a parser here. Your following code here:
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page)
table = soup.find(id='table')
should be:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find(id='table')
What your issue is how you define the url inside the for-loop. I managed to loop through all the elements, but how you define the url
is specifically the issue. The way you read the define the url
inside the for-loop returns blankspace.
So you say it returns just the last item. When it gets' to the last item, it'll fetch the url in the for-loop. But the url is just blankspace, and the key already exists in movies. Therefore, it'll overwrite the existing data there.
I'm not sure how you wanted the url
defined, but this code does as you intend - fetch all the movies, their names, href
values, and return the first 10. The only differences should be how you define url
and movies[url]
, but be careful not to trip up on the url again.
Also, the way you redefine url
within the for-loop to represent a unique-ID should reflect that, as such - name it unique_id (or, in this example uid
). I also included print statements to demonstrate it goes through the entire loop and also gets the first 10 values.
def Get_Top_List_GR(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find(id='table')
rows = [row for row in table.find_all('tr')]
movies = {}
for row in rows[1:]:
items = row.find_all('td')
link = items[1].find('a')
title, url_string = link.text, link['href']
# split url string into unique movie serial number
uid = url_string.split("/")[-2]
print("{0} - {1} - {2}".format(url, title, uid))
# set serial number as key to avoid duplication in any other category- especially title
movies[uid] = [url_string] + [i.text for i in items]
movie_page = pd.DataFrame(movies).T # transpose
return movie_page
df_test_ = Get_Top_List_GR('https://www.boxofficemojo.com/year/2019/?grossesOption=calendarGrosses&area=BR/')
print(df_test_.head(10))
Answered By - astridonkey
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.