Issue
I am using PyTorch in an audio deep learning project. I am using the torchaudio.load
method to load the waveform and sample rate. Now, my question is, is the waveform considered the "raw" audio data? Is it the PCM data? If not, then how can I get PCM data from .ogg
format?
Solution
Solution
Yes, it's raw data.
For the explanation read below. If you know about sampling theory and how sound is generated, skip to the last paragraph.
PCM
PCM is a fancy way to explain the process by which a continuous time wave is represented inside a computer. You can learn more in any introductory course/book of digital signal processing such chapter 3 of The Scientist and Engineer's Guide to Digital Signal Processing.
Briefly in a computer you can only represent finite quantities, so you need to take discrete samples in time (sampling) at a certain amplitudes (quantization).
When loading any audio file this process has been already done for you.
RAW DATA
If you connect a speaker and you play a wave, the membrane will oscillate as the amplitude of such wave at every instant. This is the "raw" audio, a signal that contains the amplitude at each "time" instant. If you can "see" the wave changing with no discontinuity from left to right when plotting your data, it is very likely a raw vector.
What is non-raw data then? Every compression algorithm modifies the input vector with any sort of mathematical function, so that it occupies less space, but also is not understandable anymore by just looking at it. This is because the samples don't represent anymore an amplitude over time. If you'd play the compressed wave through a speaker you wouldn't get any sound, only noise.
Pytorch
In the example you provided from the pytorch documentation we can clearly see that the plot represents raw data, sampled at 16kHz.
To exclude the possibility that
torchaudio.load
could still give a sort of compressed object- the raw data is generated and plotted by
plot_waveform
We can see that the waveform
variable is long 54400
samples and sampled at 16kHz. This means it represents 54400*(1/16000) seconds, which are exactly 3.4s.
The plot shows 3.4seconds, thus telling us that what is represented in the variable waveform
returned by the load function is the raw data.
Answered By - Fra93
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.