Issue
As I said in the title, I'm scraping some information on Letterboxd and need help.
I already have a function where I can scrape all info that I need (such as name, date, cast etc) from a URL like this https://letterboxd.com/film/when-marnie-was-there/
The point is that I also want to scrape all the movies I've already watched (which you can find here https://letterboxd.com/gfac/films/diary/) and after that use their URL to run my other function.
But looking into the devtools on my browser I can't find the complete movie URL in my diary. So I was thinking if I can extract one of the two pieces of info highlighted in the screenshot. If yes, I can after concatenate
"https://letterboxd.com/" + "film/when-marnie-was-there/"
and run my other function.
This is what I got until now:
def teste(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
elem = soup.find_all("h3", {"class": "headline-3 prettify"})[0]
return elem
a = teste("https://letterboxd.com/gfac/films/diary/")
print(a)
<h3 class="headline-3 prettify"><a href="/gfac/film/when-marnie-was-there/">When Marnie Was There</a></h3>
Solution
You are on the right track, so extract the href
value with .get('href)
and concat with base url. To generate a list
of urls that you can iterate to scrape use a list comprehension
:
diary_urls = ['https://letterboxd.com' + a.get('href').replace('/gfac','') for a in soup.select('h3>a[href]')]
or with more compact with slicing instead of .replace()
:
diary_urls = [base_url + a.get('href')[5:] for a in soup.select('h3>a[href]')]
Note: You could go with find_all(), I used select
and css selectors
for convenience here and to select the elements more specific - Only direct <h3>
following <a>
with href
attribute
Example
import requests
from bs4 import BeautifulSoup
base_url = 'https://letterboxd.com'
r = requests.get(f'{base_url}/gfac/films/diary/')
soup = BeautifulSoup(r.content, "html.parser")
diary_urls = [base_url + a.get('href')[5:] for a in soup.select('h3>a[href]')]
data = []
for url in diary_urls[:2]:
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
data.append({
'title': soup.select_one('#film-page-wrapper h1').get_text(),
'cast':soup.select_one('#tab-cast p').get_text(',',strip=True),
'what ever':'you like to scrape'
})
data
Output
[{'title': 'When Marnie Was There', 'cast': 'Sara Takatsuki,Kasumi Arimura,Nanako Matsushima,Susumu Terajima,Toshie Negishi,RyƓko Moriyama,Kazuko Yoshiyuki,Hitomi Kuroki,Hiroyuki Morisaki,Takuma Otoo,Hana Sugisaki,Bari Suzuki,Shigeyuki Totsugi,Ken Yasuda,Yo Oizumi,Yuko Kaida', 'what ever': 'you like to scrape'}, {'title': 'The Fly', 'cast': 'Jeff Goldblum,Geena Davis,John Getz,Joy Boushel,Leslie Carlson,George Chuvalo,Michael Copeman,David Cronenberg,Carol Lazare,Shawn Hewitt,Typhoon', 'what ever': 'you like to scrape'},...]
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.