Issue
I'm using the following subprocess call to use a command line tool. The output of the command line tool isn't printed in one go, it prints immediately on the command line, it generates over multiple lines over a period of time. The tool is bs1770gain and the command would be "path\to\bs1770gain.exe" "-i" "\path\to\audiofile.wav", By using the --loglevel parameter you can include more data but you cannot remove the progressive results being written to stdout.
I need stdout to return a human readable string (hence the stdout_formatted operation):
with subprocess.Popen(list_of_args, stdout=subprocess.PIPE, stderr=subprocess.PIPE) as proc:
stdout, stderr = proc.communicate()
stdout_formatted = stdout.decode('UTF-8')
stderr_formatted = stderr.decode('UTF-8')
However I can only view the variable as a human readable string if I print it e.g.
In [23]: print(stdout_formatted )
nalyzing ... [1/2] "filename.wav":
integrated: -2.73 LUFS / -20.27 LU [2/2]
"filename2.wav":
integrated: -4.47 LUFS / -18.53 LU
[ALBUM]:
integrated: -3.52 LUFS / -19.48 LU done.
In [24]: stdout_formatted
Out[24]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\.......
In [6]: stdout
Out[6]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\......
In [4]: type(stdout)
Out[4]: bytes
In [5]: type(stdout_formatted)
Out[5]: str
If you look carefully, the human readable chars are in the string (the first word is "analyzing"
I guessed that the stdout value needs decoding/encoding so I tried different ways:
stdout_formatted.encode("ascii")
Out[18]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g
stdout_formatted.encode("utf-8")
Out[17]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\
stdout.decode("utf-8")
Out[15]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\
stdout.decode("ascii")
Out[14]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\
bytes(stdout).decode("ascii")
Out[13]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\
I used a library called chardet to check the encoding of stdout:
import chardet
chardet.detect(stdout)
Out[26]: {'confidence': 1.0, 'encoding': 'ascii', 'language': ''}
I'm working on Windows 10 and have am using python 3.6 (the anaconda package and it's integrated Spyder IDE).
I'm kind of clutching at straws now - is it possible to capture what is displayed in the console when print is called in a variable or remove the unwanted bytecode in the stdout string?
Solution
You don't have UTF-8 data. You have UTF-16 data. UTF-16 uses two bytes for every character; characters in the ASCII and Latin-1 ranges (such as a
), still use 2 bytes, but one of those bytes is always a \x00
NUL byte.
Because UTF-16 always uses 2 bytes for every character, their order starts to matter. Encoders can pick between the two options; one is called Little Endian, the other Big Endian. Normally, encoders then include a Byte Order Mark at the very start, so that the decoder knows which of the two order options to use when decoding.
Your posted data doesn't appear to include the BOM (I don't see the 0xFF
and 0xFE
bytes, but your data does look like it is using little-endian ordering. That fits with this being Windows; Windows always uses little-endian ordering for it's UTF-16 output.
If your data does have the BOM present, you can just decode as 'utf-16'
. If the BOM is missing, use 'utf-16-le'
:
>>> sample = b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00'
>>> sample.decode('utf-16-le')
'analyzin'
>>> import codecs
>>> (codecs.BOM_UTF16_LE + sample)
b'\xff\xfea\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00'
>>> (codecs.BOM_UTF16_LE + sample).decode('utf-16')
'analyzin'
Answered By - Martijn Pieters
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.