Issue
I am trying to web scrape a list of YouTube videos and I want to collect each video's YouTube description. However, I am unsuccessful and do not understand why so. Any help is much appreciated. (Youtube video in question: https://www.youtube.com/watch?v=57Tjvv_pCXg&t=55s)
element_titles = driver.find_elements_by_id("video-title")
result = requests.get(element_titles[1].get_attribute("href"))
soup = BeautifulSoup(result.content)
description = str(soup.find("div", {"class": "style-scope yt-formatted-string"}))
The results of the decription is None
Note I understand that there exists a Youtube API however you must pay for an API key and it is not in my interest to do so
Solution
To extract the description you can use both selenium or beautifulsoup. The latter is faster, here is the code
import re
soup = BeautifulSoup(requests.get('https://www.youtube.com/watch?v=57Tjvv_pCXg').content)
pattern = re.compile('(?<=shortDescription":").*(?=","isCrawlable)')
description = pattern.findall(str(soup))[0].replace('\\n','\n')
print(description)
If you run print(soup.prettify())
and look for a part of the video description, say know this is just my
, you will see that the complete description is inside a big json structure
...,"isOwnerViewing":false,"shortDescription":"Listen: https://quellechris360.bandcamp.com/album/deathfame\n\nQuelle Chris delivers what might be his most challengi...bla bla...ABSTRACT HIP HOP\n\n7/10\n\nY'all know this is just my opinion, right?","isCrawlable":true,"thumbnail":{...
In particular the description is included between shortDescription":"
and ","isCrawlable
, so we can use regex to extract the substring included between these two strings. The regex command to find every character (.*
) included between the two strings is (?<=shortDescription":").*(?=","isCrawlable)
Answered By - sound wave
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.