Issue
YouTube's HTML has some custom elements such as yt-formatted-string
, you see you cannot decode that in the HTML
parser without having youtube's javascript, beautifulsoup4
is parsing it wrong.
Here's the code:
from bs4 import BeautifulSoup
import requests
url = "https://www.youtube.com/watch?v=S4E4yAktjug"
response = requests.get(url)
if response.status_code == 200:
doc = BeautifulSoup(response.text, "html.parser")
data_container = doc.find('div', {'id': 'info-container'})
print(data_container.prettify())
prints this:
<div id="info-container">
<div class="skeleton-light-border-bottom" id="primary-info">
<div class="text-shell skeleton-bg-color" id="title">
</div>
<div id="info">
<div class="text-shell skeleton-bg-color" id="count">
</div>
<div class="flex-1">
</div>
<div id="menu">
<div class="menu-button skeleton-bg-color">
</div>
<div class="menu-button skeleton-bg-color">
</div>
<div class="menu-button skeleton-bg-color">
</div>
<div class="menu-button skeleton-bg-color">
</div>
<div class="menu-button skeleton-bg-color">
</div>
</div>
</div>
</div>
<div class="skeleton-light-border-bottom" id="secondary-info">
<div id="top-row">
<div class="flex-1" id="video-owner">
<div class="skeleton-bg-color" id="channel-icon">
</div>
<div class="flex-1" id="upload-info">
<div class="text-shell skeleton-bg-color" id="owner-name">
</div>
<div class="text-shell skeleton-bg-color" id="published-date">
</div>
</div>
</div>
<div class="skeleton-bg-color" id="subscribe-button">
</div>
</div>
</div>
</div>
[EDIT] These are the expected values:
2.4M views 1 year ago[End Of EDIT]
So, when I try to retrieve the view count it returns as None. is there a possible fix?
I tried doing this one by one, first getting the info container, then another element, and then inside that view count, but it returned NoneType object has no attribute find.
I also tried listing all spans and extracting the one with views but that was inefficient, confusing, and failed.
Solution
@Rayaankhan, since there's javascript
involved and the requests lib does not support that. That's why you get different HTML
content. But you still get all the data inside one of the script
tags and the data lies inside a deep nested JSON
which you'll need to parse –
Ajeet Verma. As you see in Ajeet's comment, The requests library does not return javascript, so I need to externally get the javascript and render the HTML
with it.
Answered By - Rayaan khan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.