Issue
I am trying to code a sort of YouTube downloader that takes a YouTube video url, uses requests and BeautifulSoup to scrape the download link of that video from an online video downloader.
Website used - https://www.y2mate.com/
Approach: nifty feature of above website that allows the following:
Use https://www.youtubepp.com/
followed by the video link, say https://www.youtube.com/watch?v=dQw4w9WgXcQ
.
Doing so takes you to the website, with the YouTube video already entered in the search bar.
This thus allows using this kind of special link in the requests
module to extract the required links.
Issues:
Inspecting the download link shows href like so (for 480p download):
<a href="javascript:void(0)" rel="nofollow" type="button" class="btn btn-success" data-toggle="modal" data-target="#progress" data-ftype="mp4" data-fquality="480"> <i class="glyphicon glyphicon-download-alt"></i> Download </a>
How do I extract the link from this
href="javascript:void(0)
?
I looked into this SO question but it doesnt help me because I cant find the onClick attributeApart from this issue, i ran the following code to extract the html of the page:
from bs4 import BeautifulSoup import requests DOMAIN = "https://www.youtubepp.com/" URL = "https://www.youtube.com/watch?v=dQw4w9WgXcQ" download_url = DOMAIN + URL params = { "hl": "en" # I needed this because by default I was not getting English } res = requests.get(download_url, params=params) soup = BeautifulSoup(res.text, "html.parser") print(soup.prettify())
On inspecting the output, I find that the part which has the links I need is not even displayed. How do I then succesfully extract the download link if it does not even come after parsing with beautifulsoup.
Alternate attempt:
Due to the first issue, I tried using a different website from which to extract the download link.
In this website, we can use https://www.ssyoutube.com/
followed by the watch?v=dQw4w9WgXcQ
part of the video link(http://youtube.com/watch?v=dQw4w9WgXcQ)
This does solve the first issue, but the second issue is also found by using this website
EDIT 3 (After @Lima's response)
I tried the without regex approach but it gives the same error, the reason was apparent by your debug messages
Jim Yosef - Link [NCS Release]
DEBUG:
DEBUG: (<- need to bee the value of var k__id)
1: {'quality': '1080p HFR', 'type': 'mp4'}
2: {'quality': '720p HFR', 'type': 'mp4'}
3: {'quality': '480', 'type': 'mp4'}
4: {'quality': '360', 'type': 'mp4'}
5: {'quality': '240p', 'type': 'mp4'}
6: {'quality': '144p', 'type': 'mp4'}
7: {'quality': '144p', 'type': '3gp'}
8: {'quality': '128', 'type': 'mp3'}
9: {'quality': '128', 'type': 'mp3'}
Select stream [1-9]: 3
Traceback (most recent call last):
File "c:\Users\rayya\Other\Web_Scraping\Youtube Download\sol2.py", line 71, in <module>
toDownload = download.attrs["href"]
AttributeError: 'NoneType' object has no attribute 'attrs'
So for some reason the .getText()
method returns None
(I believe it return None cuz of that output). Which tells me it was the same problem last time, .getText()
would return None
and then regex
would not find any matches.
However, because you said we're looking for the var k_id
value. I tried using my 'fix' to get the <script>
tag and then used regex on that to get the value of var k_id
.
myScript = soup.findAll("script", {"type": "text/javascript"})[0] # instead of using the .find().getText() method
print("DEBUG:", myScript) # this worked, I skipped it in the output cuz of clutter
getId = re.compile(r'(?<=var k__id = ")\w*(?=";)')
tmpID = getId.findall(myScript)[0]
print("DEBUG:", tmpID, "(<- need to bee the value of var k__id)")
Unfortunately gave the following error:
Traceback (most recent call last):
File "c:\Users\rayya\Other\Web_Scraping\Youtube Download\main.py", line 62, in <module>
tmpID = getId.findall(myScript)[0]
File "C:\Users\rayya\AppData\Local\Programs\Python\Python39\lib\re.py", line 241, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
BUT BUT BUT...
I printed type(myScript)
and got <class 'bs4.element.Tag'>
All I had to do now was myScript = str(myScript)
and then everything worked like a charm. It was like music to my ears, but to my eyes. (Dont ask :) lol)
EDIT (After attempting @Lima 's solution)
I am getting the following error (I used the video id that you commented:
Youtube video id: 9iHM6X6uUH8
Jim Yosef - Link [NCS Release]
Traceback (most recent call last):
File "c:\Users\rayya\Other\Web_Scraping\Youtube Download\main.py", line 58, in <module>
tmpID = re.findall(getId, soup.find("script", {"type": "text/javascript"}).getText())[0]
IndexError: list index out of range
EDIT 2 (examining the error)
I noticed that in the line tmpID = re.findall(getId, soup.find("script",("type": "text/javascript"}).getText())[0]
we are looking for something to do with "script"
and "type text/javascript"
So after the declaration of soup
, I printed soup.prettify()
. Here is the output of that and you notice in line 273 is the thing we are looking for, but it was not being found for some reason
I tried changing the line to tmpID = soup.findAll("script", {"type": "text/javascript"})[0]
by pretty much copying the syntax of the links
variable declaration just below. And it worked
BUT...
It now gives a whole new error
Traceback (most recent call last):
File "c:\Users\rayya\Other\Web_Scraping\Youtube Download\main.py", line 91, in <module>
toDownload = download.attrs["href"]
AttributeError: 'NoneType' object has no attribute 'attrs'
So again, I printed soup.prettify()
but this time after the second declaration of soup
. This is the output. And I have no idea how to proceed any further
Solution
EDIT: I avoided regex:
#!/usr/bin/python3
from bs4 import BeautifulSoup
from urllib import parse
import requests, json#, re
Id = parse.quote(input("Youtube video id: ")) # Like: 9iHM6X6uUH8
res = requests.post("https://www.y2mate.com/mates/en115/analyze/ajax",
headers={
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"X-Requested-With": "XMLHttpRequest",
"Alt-Used": "www.y2mate.com",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin"
},
data="url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D{}&q_auto=0&ajax=1".format(Id)
)
data = json.loads(res.content)
if not data['status'] == 'success':
raise RuntimeError(f"data['status'] == {data['status']}")
soup = BeautifulSoup(data['result'], "html.parser")
if not soup.findChild().attrs['class'] == ['tabs', 'row']:
raise FileNotFoundError("Video does not exist")
name = soup.find('div', {'class': ['caption', 'text-left']}).findChild('b').getText()
print(name)
myScript = soup.find('script', {'type':'text/javascript'}).getText()
print('DEBUG:', myScript)
#getId = re.compile(r'(?<=var k__id = ")\w*(?=";)')
tmpID = myScript[70+len(name):70+24+len(name)] #re.findall(getId, myScript)[0]
print('DEBUG:', tmpID, '(<- need to bee the value of var k__id)')
links = soup.findAll('a', {'class': ["btn", "btn-success"]})
streams = []
for i, link in enumerate(links):
stream = {
'quality': link.attrs['data-fquality'],
'type': link.attrs['data-ftype']
}
streams.append(stream)
print(f'{i+1}: {stream}')
myStream = streams[int(input(f"Select stream [1-{len(streams)}]: "))-1]
res = requests.post("https://www.y2mate.com/mates/convert",
headers={
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"X-Requested-With": "XMLHttpRequest",
"Alt-Used": "www.y2mate.com",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin"
},
data="type=youtube&_id={verify}&v_id={vid}&ajax=1&token=&ftype={type}&fquality={quality}" \
.format(**myStream, verify=tmpID, vid=Id)
)
data = json.loads(res.content)
if not data['status'] == 'success':
raise RuntimeError(f"data['status'] == {data['status']}")
soup = BeautifulSoup(data['result'], "html.parser")
download = soup.find('a', {'class': ['btn', 'btn-success', 'btn-file']})
toDownload = download.attrs['href']
print('Here is the download link:')
print(toDownload)
I finded out this with chrome devtools:
- I opened the network tab
- Pasted a yt url
- Clicked a download button
- And I saw 2 POST requests (to what url, with headers, date, and respons)
Answered By - Lima
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.