Wednesday, January 10, 2024

[FIXED] How do I read a binary file into a Pandas DataFrame using Numpy dtypes?

January 10, 2024 binary, dataframe, numpy, pandas, python No comments

Issue

I want to remove rows in a DataFrame that I have generated using by using a Numpy.dtype template to read in a binary file. I've used multiple methods dropping a row and continue to be stymied by errors, typically:

TypeError: void() takes at least 1 positional argument (0 given)

Opening the variable explorer in an IDE shows the same error when trying to inspect the column name, which suggests an incorrect method for ingesting the data is somehow corrupting the column names.

I load the data in the following manner (number of variables shortened here for brevity):

```
data_template = np.dtype([
    ('header_a','V22'),
    ('variable_A','>u2'),
    ('gpssec','>u4')
    ])

with open(source_file, 'rb') as f: byte_data = f.read()
np_data = np.frombuffer(byte_data, data_template)
df = pd.DataFrame(np_data)
```

When I try to use a method to reduce the DataFrame.

`df = df[df['gpssec'] > 1000]`

I get...

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\frame.py:3798 in __getitem__
      return self._getitem_bool_array(key)

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\frame.py:3853 in _getitem_bool_array
      return self._take_with_is_copy(indexer, axis=0)

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\generic.py:3902 in _take_with_is_copy
      result = self._take(indices=indices, axis=axis)

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\generic.py:3886 in _take
      new_data = self._mgr.take(

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\internals\managers.py:978 in take
      return self.reindex_indexer(

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\internals\managers.py:751 in  reindex_indexer
      new_blocks = [

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\internals\managers.py:752 in <listcomp>
      blk.take_nd(

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\internals\blocks.py:880 in take_nd
      new_values = algos.take_nd(

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\array_algos\take.py:117 in take_nd
      return _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\array_algos\take.py:134 in _take_nd_ndarray
      dtype, fill_value, mask_info = _take_preprocess_indexer_and_fill_value(

    File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\array_algos\take.py:582 in _take_preprocess_indexer_and_fill_value
      dtype, fill_value = arr.dtype, arr.dtype.type()

    TypeError: void() takes at least 1 positional argument (0 given)

    ```

I've been able to work around the problem by copying each column of relevant data into a blank DataFrame that doesn't have the corrupt headers, but it's a kludgy solution. Not sure this qualifies as a bug as it's very likely it's a user error, but I can't find anything obvious I'm doing wrong.

Solution

In [230]: data_template = np.dtype([
     ...:     ('header_a','V22'),
     ...:     ('variable_A','>u2'),
     ...:     ('gpssec','>u4')
     ...:     ])

Making a dummy array from this dtype:

In [231]: arr = np.zeros(4, data_template)
In [232]: arr
Out[232]: 
array([(b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 0, 0),
       (b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 0, 0),
       (b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 0, 0),
       (b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 0, 0)],
      dtype=[('header_a', 'V22'), ('variable_A', '>u2'), ('gpssec', '>u4')])

We can make a dataframe from it:

In [233]: df = pd.DataFrame(arr)

In [234]: df.describe()
Out[234]: 
       variable_A  gpssec
count         4.0     4.0
mean          0.0     0.0
std           0.0     0.0
min           0.0     0.0
25%           0.0     0.0
50%           0.0     0.0
75%           0.0     0.0
max           0.0     0.0

But display or info raises an error:

In [235]: df.info()
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

I can test a column:

In [238]: df['gpssec']<100
Out[238]: 
0    True
1    True
2    True
3    True
Name: gpssec, dtype: bool

In [240]: df1=df[df['gpssec']<100]

but display of df1 again raise the error

display without the void column is ok:

In [244]: df1[['gpssec', 'variable_A']]
Out[244]: 
   gpssec  variable_A
0       0           0
1       0           0
2       0           0
3       0           0

display of the void produces your error:

In [245]: df1[['gpssec', 'header_a']]
TypeError: void() takes at least 1 positional argument (0 given)

So pandas has problems with that void dtype.

In [258]: df1.dtypes
Out[258]: 
header_a      |V22
variable_A     >u2
gpssec         >u4
dtype: object

I suspect a 22 byte bytestring could hold the same data, but without these void problems. But I haven't worked a lot with void/bytestrings

In [259]: data_template = np.dtype([
     ...:     ('header_a','S22'),
     ...:     ('variable_A','>u2'),
     ...:     ('gpssec','>u4')
     ...:     ])

The `void` error:

In [273]: np.void()
TypeError: void() takes at least 1 positional argument (0 given)


In [274]: np.void(22)
Out[274]: void(b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')

https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.void

This void section talks about a void representing a structured array. If I construct a nest structured array, I get the same sort of errors

In [275]: data_template = np.dtype([
     ...:     ('header_a',[('x',int, 3)]),
     ...:     ('variable_A','>u2'),
     ...:     ('gpssec','>u4')
     ...:     ])

In [276]: arr = np.zeros(4, data_template)

In [277]: arr
Out[277]: 
array([(([0, 0, 0],), 0, 0), (([0, 0, 0],), 0, 0), (([0, 0, 0],), 0, 0),
       (([0, 0, 0],), 0, 0)],
      dtype=[('header_a', [('x', '<i4', (3,))]), ('variable_A', '>u2'), ('gpssec', '>u4')])

In [278]: df = pd.DataFrame(arr)

In [279]: df
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [284]: df.dtypes
Out[284]: 
header_a      [('x', '<i4', (3,))]
variable_A                     >u2
gpssec                         >u4
dtype: object

So pandas has problems with a column that itself is a structured array.

Answered By - hpaulj

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 10, 2024

[FIXED] How do I read a binary file into a Pandas DataFrame using Numpy dtypes?

Issue

Solution

The `void` error:

0 comments:

Post a Comment

Popular Posts

Labels

Wednesday, January 10, 2024

Issue

Solution

The void error:

0 comments:

Post a Comment

Popular Posts

Labels

The `void` error: