Issue
I am trying to do some data cleaning and using the pandas 'itertuples' function to generate named tuples for storage in a data frame. However, when I use itertuples the column named 'class' is being stored as '_1' in the named tuple, whereas all the other column names convert correctly. For instance, the 'subclass' column correctly converts to 'subclass' in the named tuple.
Code and output for one row is as follows:
ipcs.rename(columns={'ipc_section':'section',
'ipc_class':'class',
'ipc_subclass':'subclass',
'ipc_main_group':'group',
'ipc_subgroup':'subgroup',
'ipc_sequence':'order'}, inplace=True)
[item for item in
ipcs[['section','class', 'subclass', 'group', 'subgroup', 'order']]
.itertuples(index=False,name='IPC')]
Out[45]:
[IPC(section='A', _1='61', subclass='F', group='9', subgroup='00', order='0')]
What is going on here? I assume it's something to do with 'class' being a keyword in Python. Any way to get around this?
Solution
Found the answer in the documentation for namedtuples and itertuples.
From the namedtuples documentation we find the following.
The full namedtuple function is:
collections.namedtuple(typename, field_names, *, rename=False, defaults=None, module=None)
And the documentation states: "If rename is true, invalid fieldnames are automatically replaced with positional names. For example, ['abc', 'def', 'ghi', 'abc'] is converted to ['abc', '_1', 'ghi', '_3'], eliminating the keyword def and the duplicate fieldname abc."
In the Pandas itertuples function documentation we see the following:
if name is not None and len(self.columns) + index < 256:
itertuple = collections.namedtuple(name, fields, rename=True)
return map(itertuple._make, zip(*arrays))
Therefore, if we specify a name for the tuple (ergo making it a named tuple rather than normal tuple) we trigger this function and the Pandas function specifies the rename parameter as True so it automatically converts 'class' which is an invalid field name to a positional name.
Notice that this differs slightly from @chepner's comment on the question. Specifically, it IS possible to use 'class' as a column name (setting 'ipc_class' to 'class' as a column name does work) BUT the itertuples function sets the rename parameter to True so when the column names are passed to itertuples the field name changes to a positional one. If rename is set to False the namedtuple function throws an error instead.
Answered By - bradchattergoon
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.