Issue
I'm trying to scrape all the movies and release dates on this Wikipedia page across multiple tables:
https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films
This is my code:
url = "https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"
res = requests.get(url).text
soup = BeautifulSoup(res, 'lxml')
for items in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
data = items.find_all(['th', 'td'])
try:
movie = data[0].i.a.text
except IndexError:
pass
print("{}".format(movie))
However, I'm only getting the movie titles from the first 1930s–1940s table. What I'm hoping for is two columns like this:
"Snow White and the Seven Dwarfs" | "December 21, 1937"
"Pinocchio" | "February 7, 1940"
"Fantasia" | "November 13, 1940"
How would I get this?
Solution
Since you expect two columns (and potentially a dataframe), you can use read_html
from pandas :
#pip install pandas
import pandas as pd
wiki_link = "https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"
df = (pd.concat(pd.read_html(wiki_link), ignore_index=True)
[["Title", "Release date"]].dropna(subset=["Title"]))
Output :
print(df)
Title Release date
0 Snow White and the Seven Dwarfs December 21, 1937
1 Pinocchio February 7, 1940
2 Fantasia November 13, 1940
.. ... ...
618 Untitled Zootopia sequel TBA
619 World's Best ‡ TBA
620 Wouldn't It Be Nice ‡ TBA
[600 rows x 2 columns]
Answered By - Timeless
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.