Monday, October 31, 2022

[FIXED] subprocess stdout string decoding not working

October 31, 2022 ascii, python-3.x, string, subprocess, utf-8 No comments

Issue

I'm using the following subprocess call to use a command line tool. The output of the command line tool isn't printed in one go, it prints immediately on the command line, it generates over multiple lines over a period of time. The tool is bs1770gain and the command would be "path\to\bs1770gain.exe" "-i" "\path\to\audiofile.wav", By using the --loglevel parameter you can include more data but you cannot remove the progressive results being written to stdout.

I need stdout to return a human readable string (hence the stdout_formatted operation):

with subprocess.Popen(list_of_args, stdout=subprocess.PIPE,  stderr=subprocess.PIPE) as proc:
    stdout, stderr = proc.communicate()
    stdout_formatted = stdout.decode('UTF-8')
    stderr_formatted = stderr.decode('UTF-8')

However I can only view the variable as a human readable string if I print it e.g.

In [23]: print(stdout_formatted )
      nalyzing ...   [1/2] "filename.wav": 
          integrated:  -2.73 LUFS / -20.27 LU   [2/2] 
      "filename2.wav":         
          integrated:  -4.47 LUFS / -18.53 LU   
      [ALBUM]:
          integrated:  -3.52 LUFS / -19.48 LU done.

In [24]: stdout_formatted 
Out[24]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\.......

In [6]: stdout
Out[6]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\......

In [4]: type(stdout)
Out[4]: bytes

In [5]: type(stdout_formatted)
Out[5]: str

If you look carefully, the human readable chars are in the string (the first word is "analyzing"

I guessed that the stdout value needs decoding/encoding so I tried different ways:

stdout_formatted.encode("ascii")
Out[18]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g

stdout_formatted.encode("utf-8")
Out[17]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\

stdout.decode("utf-8")
Out[15]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\

stdout.decode("ascii")
Out[14]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\

bytes(stdout).decode("ascii")
Out[13]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\

I used a library called chardet to check the encoding of stdout:

import chardet

chardet.detect(stdout)
Out[26]: {'confidence': 1.0, 'encoding': 'ascii', 'language': ''}

I'm working on Windows 10 and have am using python 3.6 (the anaconda package and it's integrated Spyder IDE).

I'm kind of clutching at straws now - is it possible to capture what is displayed in the console when print is called in a variable or remove the unwanted bytecode in the stdout string?

Solution

You don't have UTF-8 data. You have UTF-16 data. UTF-16 uses two bytes for every character; characters in the ASCII and Latin-1 ranges (such as a), still use 2 bytes, but one of those bytes is always a \x00 NUL byte.

Because UTF-16 always uses 2 bytes for every character, their order starts to matter. Encoders can pick between the two options; one is called Little Endian, the other Big Endian. Normally, encoders then include a Byte Order Mark at the very start, so that the decoder knows which of the two order options to use when decoding.

Your posted data doesn't appear to include the BOM (I don't see the 0xFF and 0xFE bytes, but your data does look like it is using little-endian ordering. That fits with this being Windows; Windows always uses little-endian ordering for it's UTF-16 output.

If your data does have the BOM present, you can just decode as 'utf-16'. If the BOM is missing, use 'utf-16-le':

>>> sample = b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00'
>>> sample.decode('utf-16-le')
'analyzin'
>>> import codecs
>>> (codecs.BOM_UTF16_LE + sample)
b'\xff\xfea\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00'
>>> (codecs.BOM_UTF16_LE + sample).decode('utf-16')
'analyzin'

Answered By - Martijn Pieters

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, October 31, 2022

[FIXED] subprocess stdout string decoding not working

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels