Monday, February 14, 2022

[FIXED] How do I extract link from website if after parsing in Python, the download link is not extracted. And also the href attribute is javascript:void(0)

February 14, 2022 beautifulsoup, html, javascript, python, python-requests No comments

Issue

I am trying to code a sort of YouTube downloader that takes a YouTube video url, uses requests and BeautifulSoup to scrape the download link of that video from an online video downloader.

Website used - https://www.y2mate.com/

Approach: nifty feature of above website that allows the following: Use https://www.youtubepp.com/ followed by the video link, say https://www.youtube.com/watch?v=dQw4w9WgXcQ.
Doing so takes you to the website, with the YouTube video already entered in the search bar.
This thus allows using this kind of special link in the requests module to extract the required links.

Issues:

Inspecting the download link shows href like so (for 480p download): <a href="javascript:void(0)" rel="nofollow" type="button" class="btn btn-success" data-toggle="modal" data-target="#progress" data-ftype="mp4" data-fquality="480"> <i class="glyphicon glyphicon-download-alt"></i>  Download </a>

How do I extract the link from this href="javascript:void(0)?
I looked into this SO question but it doesnt help me because I cant find the onClick attribute

Apart from this issue, i ran the following code to extract the html of the page:

from bs4 import BeautifulSoup
import requests

DOMAIN = "https://www.youtubepp.com/"
URL = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

download_url = DOMAIN + URL

params = {
  "hl": "en"    # I needed this because by default I was not getting English
}

res = requests.get(download_url, params=params)
soup = BeautifulSoup(res.text, "html.parser")

print(soup.prettify())

On inspecting the output, I find that the part which has the links I need is not even displayed. How do I then succesfully extract the download link if it does not even come after parsing with beautifulsoup.

Alternate attempt:

Due to the first issue, I tried using a different website from which to extract the download link.
In this website, we can use https://www.ssyoutube.com/ followed by the watch?v=dQw4w9WgXcQ part of the video link(http://youtube.com/watch?v=dQw4w9WgXcQ)

This does solve the first issue, but the second issue is also found by using this website

EDIT 3 (After @Lima's response)

I tried the without regex approach but it gives the same error, the reason was apparent by your debug messages

Jim Yosef - Link [NCS Release]
DEBUG:
DEBUG:  (<- need to bee the value of var k__id)
1: {'quality': '1080p HFR', 'type': 'mp4'}
2: {'quality': '720p HFR', 'type': 'mp4'}
3: {'quality': '480', 'type': 'mp4'}
4: {'quality': '360', 'type': 'mp4'}
5: {'quality': '240p', 'type': 'mp4'}
6: {'quality': '144p', 'type': 'mp4'}
7: {'quality': '144p', 'type': '3gp'}
8: {'quality': '128', 'type': 'mp3'}
9: {'quality': '128', 'type': 'mp3'}
Select stream [1-9]: 3
Traceback (most recent call last):
  File "c:\Users\rayya\Other\Web_Scraping\Youtube Download\sol2.py", line 71, in <module>
    toDownload = download.attrs["href"]
AttributeError: 'NoneType' object has no attribute 'attrs'

So for some reason the .getText() method returns None (I believe it return None cuz of that output). Which tells me it was the same problem last time, .getText() would return None and then regex would not find any matches.

However, because you said we're looking for the var k_id value. I tried using my 'fix' to get the <script> tag and then used regex on that to get the value of var k_id.

myScript = soup.findAll("script", {"type": "text/javascript"})[0]   # instead of using the .find().getText() method
print("DEBUG:", myScript)    # this worked, I skipped it in the output cuz of clutter

getId = re.compile(r'(?<=var k__id = ")\w*(?=";)')
tmpID = getId.findall(myScript)[0]
print("DEBUG:", tmpID, "(<- need to bee the value of var k__id)")

Unfortunately gave the following error:

Traceback (most recent call last):
  File "c:\Users\rayya\Other\Web_Scraping\Youtube Download\main.py", line 62, in <module>
    tmpID = getId.findall(myScript)[0]
  File "C:\Users\rayya\AppData\Local\Programs\Python\Python39\lib\re.py", line 241, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object

BUT BUT BUT...
I printed type(myScript) and got <class 'bs4.element.Tag'>
All I had to do now was myScript = str(myScript) and then everything worked like a charm. It was like music to my ears, but to my eyes. (Dont ask :) lol)

EDIT (After attempting @Lima 's solution)

I am getting the following error (I used the video id that you commented:

Youtube video id: 9iHM6X6uUH8
Jim Yosef - Link [NCS Release]
Traceback (most recent call last):
  File "c:\Users\rayya\Other\Web_Scraping\Youtube Download\main.py", line 58, in <module>
    tmpID = re.findall(getId, soup.find("script", {"type": "text/javascript"}).getText())[0]
IndexError: list index out of range

EDIT 2 (examining the error)

I noticed that in the line tmpID = re.findall(getId, soup.find("script",("type": "text/javascript"}).getText())[0] we are looking for something to do with "script" and "type text/javascript"

So after the declaration of soup, I printed soup.prettify(). Here is the output of that and you notice in line 273 is the thing we are looking for, but it was not being found for some reason

I tried changing the line to tmpID = soup.findAll("script", {"type": "text/javascript"})[0] by pretty much copying the syntax of the links variable declaration just below. And it worked

BUT...
It now gives a whole new error

Traceback (most recent call last):
  File "c:\Users\rayya\Other\Web_Scraping\Youtube Download\main.py", line 91, in <module>
    toDownload = download.attrs["href"]
AttributeError: 'NoneType' object has no attribute 'attrs'

So again, I printed soup.prettify() but this time after the second declaration of soup. This is the output. And I have no idea how to proceed any further

Solution

EDIT: I avoided regex:

#!/usr/bin/python3
from bs4 import BeautifulSoup
from urllib import parse
import requests, json#, re

Id = parse.quote(input("Youtube video id: ")) # Like: 9iHM6X6uUH8

res = requests.post("https://www.y2mate.com/mates/en115/analyze/ajax",
                  headers={
                      "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
                      "X-Requested-With": "XMLHttpRequest",
                      "Alt-Used": "www.y2mate.com",
                      "Sec-Fetch-Dest": "empty",
                      "Sec-Fetch-Mode": "cors",
                      "Sec-Fetch-Site": "same-origin"
                      },
                  data="url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D{}&q_auto=0&ajax=1".format(Id)
                  )
data = json.loads(res.content)
if not data['status'] == 'success':
    raise RuntimeError(f"data['status'] == {data['status']}")
soup = BeautifulSoup(data['result'], "html.parser")
if not soup.findChild().attrs['class'] == ['tabs', 'row']:
    raise FileNotFoundError("Video does not exist")
name = soup.find('div', {'class': ['caption', 'text-left']}).findChild('b').getText()
print(name)
myScript = soup.find('script', {'type':'text/javascript'}).getText()
print('DEBUG:', myScript)
#getId = re.compile(r'(?<=var k__id = ")\w*(?=";)')
tmpID = myScript[70+len(name):70+24+len(name)] #re.findall(getId, myScript)[0]
print('DEBUG:', tmpID, '(<- need to bee the value of var k__id)')
links = soup.findAll('a', {'class': ["btn", "btn-success"]})
streams = []
for i, link in enumerate(links):
    stream = {
        'quality': link.attrs['data-fquality'],
        'type': link.attrs['data-ftype']
        }
    streams.append(stream)
    print(f'{i+1}: {stream}')

myStream = streams[int(input(f"Select stream [1-{len(streams)}]: "))-1]

res = requests.post("https://www.y2mate.com/mates/convert",
                    headers={
                        "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
                        "X-Requested-With": "XMLHttpRequest",
                        "Alt-Used": "www.y2mate.com",
                        "Sec-Fetch-Dest": "empty",
                        "Sec-Fetch-Mode": "cors",
                        "Sec-Fetch-Site": "same-origin"
                        },
                    data="type=youtube&_id={verify}&v_id={vid}&ajax=1&token=&ftype={type}&fquality={quality}" \
                        .format(**myStream, verify=tmpID, vid=Id)
                    )

data = json.loads(res.content)
if not data['status'] == 'success':
    raise RuntimeError(f"data['status'] == {data['status']}")
soup = BeautifulSoup(data['result'], "html.parser")
download = soup.find('a', {'class': ['btn', 'btn-success', 'btn-file']})
toDownload = download.attrs['href']

print('Here is the download link:')
print(toDownload)

I finded out this with chrome devtools:

I opened the network tab
Pasted a yt url
Clicked a download button
And I saw 2 POST requests (to what url, with headers, date, and respons)

Answered By - Lima

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, February 14, 2022

[FIXED] How do I extract link from website if after parsing in Python, the download link is not extracted. And also the href attribute is javascript:void(0)

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels