Issue
Concretely I want to extract videos from this website: equidia. The first problem is when I launch scrapy shell https://www.equidia.fr/courses/2022-07-31/R1/C1 -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36"
. I inspected with view(response)
and notice the container of the video exists. I can click on play but there is no flow, so no video to extract. I guess the spider cannot be more able than I do in the web browser during inspection.
UPDATE:
Thanks to palpitus_on_fire I checked more the XHR
type requests while I only concentrated on media
type.
The XHR request I am interested about is: https://equidia-vodce-p-api.hexaglobe.net/video/info/20220731_35186008_00000HR/01NdvIX6xNbihxxc3U52RDBX9novEE47lA1ZmEDw/0b94eb69d04aa18ca33ba5e8767b1a85264c35b6acefc8bfbd7eabb5e259cf2c/1659574478
It returns a response json file:
{
"mp4": {
"240": "https://private4-stream-ovp.vide.io/dl/0bee7ff9bce79419fb27bac012c5d090/62e904dc/equidia_2752987/videos/240p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-240p_full_mp4.mp4?v=0",
"360": "https://private3-stream-ovp.vide.io/dl/7c53dc70ae5cc46d7e2c9d43e3dfc685/62e904dc/equidia_2752987/videos/360p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-360p_full_mp4.mp4?v=0",
"480": "https://private3-stream-ovp.vide.io/dl/4261c5bd7b48595535710e1dc34e8f93/62e904dc/equidia_2752987/videos/480p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-480p_full_mp4.mp4?v=0",
"540": "https://private3-stream-ovp.vide.io/dl/d40592e39e27931f20245edde9f30152/62e904dc/equidia_2752987/videos/540p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-540p_full_mp4.mp4?v=0",
"720": "https://private8-stream-ovp.vide.io/dl/9cb5a24a43a8d22d27e73b164020894d/62e904dc/equidia_2752987/videos/720p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-720p_full_mp4.mp4?v=0"
}
"hls": "https://private3-stream-ovp.vide.io/dl/6abf08b12203c84390a01fc4de65a4d5/62e904dc/equidia_2752987/videos/hls/63/03/S506303/V556523/playlist.m3u8?v=1",
"master": "https://private8-stream-ovp.vide.io/dl/6830b3a5039eef57f844bb08cd252584/62e904dc/equidia_2752987/videos/master/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523.mp4?v=0"
}
"720": "https://private8-stream-ovp.vide.io/dl/9cb5a24a43a8d22d27e73b164020894d/62e904dc/equidia_2752987/videos/720p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-720p_full_mp4.mp4?v=0"
contains the video and the resolution I want.
Two main problems now:
- How to obtain
XHR
request and precisely the one starting withequidia-vodce-p-api.hexaglobe.net
? The popular solutions are to use webtesters likeselenium
orplaywright
but I know none of them. - How to integrate it in a scrapy spider ?
If it is possible I would like to get a code of this form:
class VideoPageScraper(scrapy.Spider):
def __init__(self):
self.start_urls = ['https://www.equidia.fr/courses/2022-07-31/R1/C1']
def parse(self, response):
# code to catch the XHR request desired
urlvideo = 'https://private8-stream-ovp.vide.io/dl/9cb5a24a43a8d22d27e73b164020894d/62e904dc/equidia_2752987/videos/720p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-720p_full_mp4.mp4?v=0'
yield scrapy.Request(url=urlvideo, callback=self.obtainvideo)
def obtainvideo(self, response):
with #open('/home/avy/binary.mp4', returns'wb') as wfile:
wfile.write(response.body)
UPDATES
POSSIBILITY TO GENERATE XHR'S URL ?
Looking at the developement tool of Firefox I am trying to get how the desired XHR request is done.
The script main.d685b66f8277bab3376b.js produce the elements to make the XHR's url (at line 1761 as far as I understand).
Object { endpoint: "https://equidia-vodce-p-api.hexaglobe.net/video/info", publicKey: "01NdvIX6xNbihxxc3U52RDBX9novEE47lA1ZmEDw", hash: "c79502b086116f30262dfbf7675225dc1c8bfb4c144df09da9a34b9106fccd39", expiresAt: "1659608062" }
Then https://www.equidia.fr/polyfills.de328f7d13aac150968d.js
initiates a fecth with the XHR's url.
Do you think there is a way to get a generated object based on the script as shown above in order to make the desired XHR request ?
APPENDICE WITH PLAYWRIGHT TRY (headless browser)
I found a library that could be able to obtain the XHR
requests I am looking for. playwright
is a webtester and it is compatible with scrapy
with scrapy-playwright
.
Inspired by those examples, I tried the following code:
from playwright.sync_api import sync_playwright
url = 'https://www.equidia.fr/courses/2022-07-31/R1/C1'
with sync_playwright() as p:
def handle_response(response):
# the endpoint we are insterested in
if ("equidia-vodce-p-api" in response.url):
item = response.json()['mp4']
print(item)
browser = p.chromium.launch()
page = browser.new_page()
page.on("response", handle_response)
page.goto(url, wait_until="networkidle", timeout=900000)
page.context.close()
browser.close()
Exception in callback SyncBase._sync.<locals>.callback(<Task finishe...t responses')>) at /home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_sync_base.py:104
handle: <Handle SyncBase._sync.<locals>.callback(<Task finishe...t responses')>) at /home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_sync_base.py:104>
Traceback (most recent call last):
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 105, in callback
g_self.switch()
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_browser_context.py", line 124, in <lambda>
lambda params: self._on_response(
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_browser_context.py", line 396, in _on_response
page.emit(Page.Events.Response, response)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/base.py", line 176, in emit
handled = self._call_handlers(event, args, kwargs)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/base.py", line 154, in _call_handlers
self._emit_run(f, args, kwargs)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/asyncio.py", line 50, in _emit_run
self.emit("error", exc)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/base.py", line 179, in emit
self._emit_handle_potential_error(event, args[0] if args else None)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/base.py", line 139, in _emit_handle_potential_error
raise error
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/pyee/asyncio.py", line 48, in _emit_run
coro: Any = f(*args, **kwargs)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_impl_to_api_mapping.py", line 88, in wrapper_func
return handler(
File "<input>", line 7, in handle_response
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/sync_api/_generated.py", line 601, in json
self._sync("response.json", self._impl_obj.json())
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 111, in _sync
return task.result()
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_network.py", line 362, in json
return json.loads(await self.text())
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_network.py", line 358, in text
content = await self.body()
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_network.py", line 354, in body
binary = await self._channel.send("body")
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 39, in send
return await self.inner_send(method, params, False)
File "/home/avy/anaconda3/envs/Turf/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.Error: Response body is unavailable for redirect responses
{'240': 'https://private4-stream-ovp.vide.io/dl/ec808fddd1b10659cb7d3bfeb7cba591/62e9abc8/equidia_2752987/videos/240p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-240p_full_mp4.mp4?v=0', '360': 'https://private3-stream-ovp.vide.io/dl/7c5cc10f2bb466fd85d7109c698500d3/62e9abc8/equidia_2752987/videos/360p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-360p_full_mp4.mp4?v=0', '480': 'https://private3-stream-ovp.vide.io/dl/c65277c3e385fb825e0ea4f5c1447d00/62e9abc8/equidia_2752987/videos/480p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-480p_full_mp4.mp4?v=0', '540': 'https://private3-stream-ovp.vide.io/dl/036f88b63a1157ce51bf68a7e60791cc/62e9abc8/equidia_2752987/videos/540p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-540p_full_mp4.mp4?v=0', '720': 'https://private8-stream-ovp.vide.io/dl/459a1e569da63e69031d1ecbf1f87ed0/62e9abc8/equidia_2752987/videos/720p_full_mp4/63/03/S506303/V556523/20220731_35186008_00000hr-s506303-v556523-720p_full_mp4.mp4?v=0'}
I managed to obtain the piece I wanted but with a lot of exceptions that I have no idea what's wrong.
Solution
Here is a fully working solution by using scrapy-playwright
, I got the idea the following issue 61 and the users profile @lime-n.
We download the xhr
requests sent and store these into a dict with both the playwright
tools and scrapy-playwright
. I include the playwright_page_event_handlers
to integrate playwright
tools for this.
Depending on the video size, it will take a few minutes to download.
main.py
import scrapy
from playwright.async_api import Response as PlaywrightResponse
import jsonlines
import pandas as pd
import json
from video.items import videoItem
headers = {
'authority': 'api.equidia.fr',
'accept': 'application/json, text/plain, */*',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
'content-type': 'application/json',
'origin': 'https://www.equidia.fr',
'referer': 'https://www.equidia.fr/',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"macOS"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-site',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
class videoDownloadSpider(scrapy.Spider):
name = 'video'
start_urls = ['https://www.equidia.fr/courses/2022-07-31/R1/C1']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
#callback = self.parse,
meta = {
'playwright':True,
'playwright_page_event_handlers':{
"response": "handle_response",
}
}
)
async def handle_response(self, response: PlaywrightResponse) -> None:
"""
We can grab the post data with response.request.post - there are three different types for different needs.
The method below helps grab those resource types of 'xhr' and 'fetch' until I can work out how to only send these to the download request.
"""
self.logger.info(f'test the log of data: {response.request.resource_type, response.request.url, response.request.method}')
jl_file = "videos.jl"
data = {}
if response.request.resource_type == "xhr":
if response.request.method == "GET":
if 'videos' in response.request.url:
data['resource_type']=response.request.resource_type,
data['request_url']=response.request.url,
data['method']=response.request.method
with jsonlines.open(jl_file, mode='a') as writer:
writer.write(data)
def parse(self, response):
video = pd.read_json('videos.jl', lines=True)
print('KEST: %s' % video['request_url'][0][0])
yield scrapy.FormRequest(
url = video['request_url'][0][0],
headers=headers,
callback = self.parse_video_json
)
def parse_video_json(self, response):
another_url = response.json().get('video_url')
yield scrapy.FormRequest(
another_url,
headers=headers,
callback=self.extract_videos
)
def extract_videos(self, response):
videos = response.json().get('mp4')
for keys, vids in videos.items():
loader = videoItem()
loader['video'] = [vids]
yield loader
items.py
class videoItem(scrapy.Item):
video = scrapy.Field()
pipelines.py
class downloadFilesPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, item=None):
file = request.url.split('/')[-1]
video_file = f"{file}.mp4"
return video_file
settings.py
from pathlib import Path
import os
BASE_DIR = Path(__file__).resolve().parent.parent
FILES_STORE = os.path.join(BASE_DIR, 'videos')
FILES_URLS_FIELD = 'video'
FILES_RESULT_FIELD = 'results'
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
ITEM_PIPELINES = {
'video.pipelines.downloadFilesPipeline': 150,
}
Answered By - Dollar Tune-bill
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.