Issue
I am working on an LLM project on google colab using V100 GPU, High-RAM mode, and these are my dependencies:
git+https://github.com/pyannote/pyannote-audio
git+https://github.com/huggingface/[email protected]
openai==0.28
ffmpeg-python
pandas==1.5.0
tokenizers==0.14
torch==2.1.1
torchaudio==2.1.1
tqdm==4.64.1
EasyNMT==2.0.2
psutil==5.9.2
requests
pydub
docxtpl
faster-whisper==0.10.0
git+https://github.com/openai/whisper.git
Here is everything I import:
from faster_whisper import WhisperModel
from datetime import datetime, timedelta
from time import time
from pathlib import Path
import pandas as pd
import os
from pydub import AudioSegment
import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
import requests
import torch
import pyannote.audio
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
from pyannote.audio import Audio
from pyannote.core import Segment
import wave
import contextlib
import psutil
import openai
from codecs import decode
from docxtpl import DocxTemplate
I used to use torch and torchaudio in their latest versions but they got an update yesterday (15 December 2023, v2.1.2 got released). I assumed that the error I was getting was caused by the update so I pinned them to the version that my code was working in (v2.1.1) 2 days ago. Obviously, it did not work.
Everything was working 2 days ago and I didn't change anything in my notebook. The only thing that may have changed is the dependencies I was using but using the prior versions did not fix my problem. Here is the code snippet that throws the error:
def EETDT(audio_path, whisper_model, num_speakers, output_name="diarization_result", selected_source_lang="eng", transcript=None):
"""
Uses Whisper to seperate audio into segments and generate transcripts.
segment.
Speech Recognition is based on models from OpenAI Whisper https://github.com/openai/whisper
Speaker diarization model and pipeline from by https://github.com/pyannote/pyannote-audio
audio_path : str -> path to wav file
whisper_model : str -> small/medium/large/large-v2/large-v3
num_speakers : int -> number of speakers in audio (0 to let the function determine it)
output_name : str -> Desired name of the output file
selected_source_lang : str -> language's code
"""
audio_name = audio_path.split("/")[-1].split(".")[0]
model = WhisperModel(whisper_model, compute_type="int8")
time_start = time()
if(audio_path == None):
raise ValueError("Error no video input")
print("Input file:", audio_path)
if not audio_path.endswith(".wav"):
print("Submitted audio isn't in wav format. Starting conversion...")
audio = AudioSegment.from_file(audio_path)
audio_suffix = audio_path.split(".")[-1]
new_path = audio_path.replace(audio_suffix,"wav")
audio.export(new_path, format="wav")
audio_path = new_path
print("Converted to wav:", new_path)
try:
# Get duration
with contextlib.closing(wave.open(audio_path,'r')) as f:
frames = f.getnframes()
rate = f.getframerate()
duration = frames / float(rate)
if duration<30:
raise ValueError(f"Audio has to be longer than 30 seconds. Current: {duration}")
print(f"Duration of audio file: {duration}")
# Transcribe audio
options = dict(language=selected_source_lang, beam_size=5, best_of=5)
transcribe_options = dict(task="transcribe", **options)
segments_raw, info = model.transcribe(audio_path, **transcribe_options)
# Convert back to original openai format
segments = []
i = 0
full_transcript = list()
if type(transcript) != type(pd.DataFrame()):
for segment_chunk in segments_raw: # <-- THROWS ERROR
chunk = {}
chunk["start"] = segment_chunk.start
chunk["end"] = segment_chunk.end
chunk["text"] = segment_chunk.text
full_transcript.append(segment_chunk.text)
segments.append(chunk)
i += 1
full_transcript = "".join(full_transcript)
print("Transcribe audio done with fast-whisper")
else:
for i in range(len(transcript)):
full_transcript.append(transcript["text"].iloc[i])
full_transcript = "".join(full_transcript)
print("You inputted pre-transcribed audio")
except Exception as e:
raise RuntimeError("Error converting video to audio")
...The code never leaves the try block...
Solution
I'm having the same issue on Google Colab today, while trying to use faster-whisper. This custom whisper implementation is still using Cuda 11 as a requirement, and doesn't work with Cuda 12.
I tried taking a look inside of the colab instance, and it has indeed switched to cuda 12, meaning that faster-whisper can't run because there are missing dependencies.
In case you want to try to get it to work with Cuda 12, it should be doable by rebuilding CTranslate2 from source, here's a reference issue about this problem: https://github.com/OpenNMT/CTranslate2/issues/1250
Answered By - DD3Boh
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.