Issue
Here is my current code. I am not sure what I am doing wrong. Maybe I am not digging deep enough in the html and giving Beautifulsoup the right tags? At the moment, my code is returning me blanks.
from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen("https://www.youtube.com/watch?v=5_zrHZdhaBU")
soup = BeautifulSoup(html,'html.parser')
nameList = soup.findAll("div", {"id": "cp-2"})
for name in nameList:
print(name.get_text())
Here is the code that I inspected. I'm trying to get Python to return back to me "but it was untucked"
<div id="cp-2" class="caption-line" data-time="7.54"><div class="caption-line-time">0:07</div><div class="caption-line-text">but it was untucked.</div></div>
***Edit
The code can be found by clicking on "more" next to the share button. Then you click on transcripts and you will see all the text there.
Solution
Oh yes, it's loaded via Ajax: open the page, then open Network
tab, sort requests by start time (latest requests first), click CC button on Youtube.
You get api/timedtext
request, the response is an XML.
Here it the full url to the transcript:
I have no idea how this URL is generated, though. This requires invesigation of complex YouTube scripts, etc.
EDIT: This answer helped me. You can omit most of these parameters and just use this URL:
https://www.youtube.com/api/timedtext?&v=5_zrHZdhaBU&lang=en
Or this in general:
https://www.youtube.com/api/timedtext?&v={video_id}&lang={language_code}
Answered By - Andrey Moiseev
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.