Issue
I want to remove rows in a DataFrame that I have generated using by using a Numpy.dtype template to read in a binary file. I've used multiple methods dropping a row and continue to be stymied by errors, typically:
TypeError: void() takes at least 1 positional argument (0 given)
Opening the variable explorer in an IDE shows the same error when trying to inspect the column name, which suggests an incorrect method for ingesting the data is somehow corrupting the column names.
I load the data in the following manner (number of variables shortened here for brevity):
```
data_template = np.dtype([
('header_a','V22'),
('variable_A','>u2'),
('gpssec','>u4')
])
with open(source_file, 'rb') as f: byte_data = f.read()
np_data = np.frombuffer(byte_data, data_template)
df = pd.DataFrame(np_data)
```
When I try to use a method to reduce the DataFrame.
`df = df[df['gpssec'] > 1000]`
I get...
File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\frame.py:3798 in __getitem__
return self._getitem_bool_array(key)
File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\frame.py:3853 in _getitem_bool_array
return self._take_with_is_copy(indexer, axis=0)
File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\generic.py:3902 in _take_with_is_copy
result = self._take(indices=indices, axis=axis)
File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\generic.py:3886 in _take
new_data = self._mgr.take(
File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\internals\managers.py:978 in take
return self.reindex_indexer(
File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\internals\managers.py:751 in reindex_indexer
new_blocks = [
File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\internals\managers.py:752 in <listcomp>
blk.take_nd(
File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\internals\blocks.py:880 in take_nd
new_values = algos.take_nd(
File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\array_algos\take.py:117 in take_nd
return _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)
File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\array_algos\take.py:134 in _take_nd_ndarray
dtype, fill_value, mask_info = _take_preprocess_indexer_and_fill_value(
File C:\ProgramData\anaconda311\Lib\site-packages\pandas\core\array_algos\take.py:582 in _take_preprocess_indexer_and_fill_value
dtype, fill_value = arr.dtype, arr.dtype.type()
TypeError: void() takes at least 1 positional argument (0 given)
```
I've been able to work around the problem by copying each column of relevant data into a blank DataFrame that doesn't have the corrupt headers, but it's a kludgy solution. Not sure this qualifies as a bug as it's very likely it's a user error, but I can't find anything obvious I'm doing wrong.
Solution
In [230]: data_template = np.dtype([
...: ('header_a','V22'),
...: ('variable_A','>u2'),
...: ('gpssec','>u4')
...: ])
Making a dummy array from this dtype:
In [231]: arr = np.zeros(4, data_template)
In [232]: arr
Out[232]:
array([(b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 0, 0),
(b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 0, 0),
(b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 0, 0),
(b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 0, 0)],
dtype=[('header_a', 'V22'), ('variable_A', '>u2'), ('gpssec', '>u4')])
We can make a dataframe from it:
In [233]: df = pd.DataFrame(arr)
In [234]: df.describe()
Out[234]:
variable_A gpssec
count 4.0 4.0
mean 0.0 0.0
std 0.0 0.0
min 0.0 0.0
25% 0.0 0.0
50% 0.0 0.0
75% 0.0 0.0
max 0.0 0.0
But display or info raises an error:
In [235]: df.info()
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
I can test a column:
In [238]: df['gpssec']<100
Out[238]:
0 True
1 True
2 True
3 True
Name: gpssec, dtype: bool
In [240]: df1=df[df['gpssec']<100]
but display of df1
again raise the error
display without the void
column is ok:
In [244]: df1[['gpssec', 'variable_A']]
Out[244]:
gpssec variable_A
0 0 0
1 0 0
2 0 0
3 0 0
display of the void
produces your error:
In [245]: df1[['gpssec', 'header_a']]
TypeError: void() takes at least 1 positional argument (0 given)
So pandas has problems with that void
dtype.
In [258]: df1.dtypes
Out[258]:
header_a |V22
variable_A >u2
gpssec >u4
dtype: object
I suspect a 22 byte bytestring could hold the same data, but without these void
problems. But I haven't worked a lot with void/bytestrings
In [259]: data_template = np.dtype([
...: ('header_a','S22'),
...: ('variable_A','>u2'),
...: ('gpssec','>u4')
...: ])
The void
error:
In [273]: np.void()
TypeError: void() takes at least 1 positional argument (0 given)
In [274]: np.void(22)
Out[274]: void(b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.void
This void section talks about a void representing a structured array. If I construct a nest structured array, I get the same sort of errors
In [275]: data_template = np.dtype([
...: ('header_a',[('x',int, 3)]),
...: ('variable_A','>u2'),
...: ('gpssec','>u4')
...: ])
In [276]: arr = np.zeros(4, data_template)
In [277]: arr
Out[277]:
array([(([0, 0, 0],), 0, 0), (([0, 0, 0],), 0, 0), (([0, 0, 0],), 0, 0),
(([0, 0, 0],), 0, 0)],
dtype=[('header_a', [('x', '<i4', (3,))]), ('variable_A', '>u2'), ('gpssec', '>u4')])
In [278]: df = pd.DataFrame(arr)
In [279]: df
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
In [284]: df.dtypes
Out[284]:
header_a [('x', '<i4', (3,))]
variable_A >u2
gpssec >u4
dtype: object
So pandas has problems with a column that itself is a structured array.
Answered By - hpaulj
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.