Tuesday, January 9, 2024

[FIXED] How can I use custom groups and groupby.agg to merge rows in a df by concatenating certain columns and summing the rest based on another df?

January 09, 2024 group-by, pandas, python No comments

Issue

This is a bit of a follow-up to an earlier question I had asked a little over a year ago

I have a "master" dataframe that has the product codes and names of a lot of materials along with their monthly consumption. something like:

product code	Name	Consumption A	Consumption B	Consumption C	Consumption D
123	AA	100	120	130	140
456	BB	5	7	9	11
789	CC	12	5	33	89
134	AD	4	17	37	57
467	BD	1	3	5	7
179	ED	6	19	30	61
426	FD	8	5	2	13

And I also have a second table that has a list of some product codes that are "alternatives" or substitutes, like:

Product Code	Alt Code
123	134
123	179
123	426
456	467

(The difference from the previous one is that now, the same "Product Code" can have multiple different "Alt Code"s)

How can I can use this second dataframe to process the first one such that it becomes:

product code	Name	Consumption A	Consumption B	Consumption C	Consumption D
123 / 134 / 179 / 426	AA / AD / ED / FD	118	161	199	271
456 / 467	BB / BD	6	10	14	18
789	CC	12	5	33	89

where the product codes and names have been concatenated into the same cell, the quantities summed, and the "duplicates" of the alternatives deleted?

Unlike last time, there are now places where a single "main" code has multiple alternatives, but they will always be listed with the main code in the first column, and the alternative in the "alt code" column. I would ideally like to merge all alternative rows into a single one using the same bit of code.

This is the code I have been trying to use, based on the answer I got last time:

    if alt_name != "":
        altf = pd.read_excel(io=alt_name)
        group = df['Material'].map(lambda x: altf.set_index('Material')['Alt Material'].get(x, x))
        d = {c: 'sum' for c in df.columns}
        out = (df
               .astype({'Material': str})
               .groupby([group], as_index=False)
               .agg({**d, **{'Material': ' / '.join, 'Description': ' / '.join}})
               )
        df = out

alt_name is the name of the excel sheet with the columns, saved as "Material" and "Alt Material".

when I try running it with a table with multiple alternatives, I get the following error:

Traceback (most recent call last):
  File "/Users/[Location]/function_code_v2.py", line 285, in <module>
    trendcharts(file_name, zone, combinezones, month, numMonths, percent_wanted, trendgen, alt_name, save_directory)
  File "/Users/[Location]/function_code_v2.py", line 106, in trendcharts
    .agg({**d, **{'Material': ' / '.join, 'Description': ' / '.join}})
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/[Location]/.venv/lib/python3.12/site-packages/pandas/core/groupby/generic.py", line 1445, in aggregate
    result = op.agg()
             ^^^^^^^^
  File "/Users/[Location]/.venv/lib/python3.12/site-packages/pandas/core/apply.py", line 175, in agg
    return self.agg_dict_like()
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/[Location]/.venv/lib/python3.12/site-packages/pandas/core/apply.py", line 406, in agg_dict_like
    return self.agg_or_apply_dict_like(op_name="agg")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/[Location]/.venv/lib/python3.12/site-packages/pandas/core/apply.py", line 1388, in agg_or_apply_dict_like
    result_index, result_data = self.compute_dict_like(
                                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/[Location]/.venv/lib/python3.12/site-packages/pandas/core/apply.py", line 480, in compute_dict_like
    getattr(obj._gotitem(key, ndim=1), op_name)(how, **kwargs)
  File "/Users/[Location]/.venv/lib/python3.12/site-packages/pandas/core/groupby/generic.py", line 275, in aggregate
    if self.ngroups == 0:
       ^^^^^^^^^^^^
  File "/Users/[Location]/.venv/lib/python3.12/site-packages/pandas/core/groupby/groupby.py", line 825, in ngroups
    return self.grouper.ngroups
           ^^^^^^^^^^^^^^^^^^^^
  File "properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
  File "/Users/[Location]/.venv/lib/python3.12/site-packages/pandas/core/groupby/ops.py", line 758, in ngroups
    return len(self.result_index)
               ^^^^^^^^^^^^^^^^^
  File "properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
  File "/Users/[Location]/.venv/lib/python3.12/site-packages/pandas/core/groupby/ops.py", line 769, in result_index
    return self.groupings[0].result_index.rename(self.names[0])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
  File "/Users/[Location]/.venv/lib/python3.12/site-packages/pandas/core/groupby/grouper.py", line 718, in result_index
    return self.group_index
           ^^^^^^^^^^^^^^^^
  File "properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
  File "/Users/[Location]/.venv/lib/python3.12/site-packages/pandas/core/groupby/grouper.py", line 722, in group_index
    codes, uniques = self._codes_and_uniques
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
  File "/Users/[Location]/.venv/lib/python3.12/site-packages/pandas/core/groupby/grouper.py", line 801, in _codes_and_uniques
    codes, uniques = algorithms.factorize(  # type: ignore[assignment]
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/[Location]/.venv/lib/python3.12/site-packages/pandas/core/algorithms.py", line 795, in factorize
    codes, uniques = factorize_array(
                     ^^^^^^^^^^^^^^^^
  File "/Users/[Location]/.venv/lib/python3.12/site-packages/pandas/core/algorithms.py", line 595, in factorize_array
    uniques, codes = table.factorize(
                     ^^^^^^^^^^^^^^^^
  File "pandas/_libs/hashtable_class_helper.pxi", line 7280, in pandas._libs.hashtable.PyObjectHashTable.factorize
  File "pandas/_libs/hashtable_class_helper.pxi", line 7194, in pandas._libs.hashtable.PyObjectHashTable._unique
TypeError: unhashable type: 'Series'

Solution

You can use the same logic, aggregate using the common id:

d = {c: 'sum' for c in df.columns[1:]}
d['Name'] = lambda x: '/'.join(map(str, x))

out = (df
   .groupby(df['product code'].replace(altdf.set_index('Alt Code')['Product Code']))
   .agg(d).reset_index()
)

Then fix the product codes:

mapper = altdf.groupby('Product Code')['Alt Code'].agg(set)
out['product code'] = out['product code'].map(lambda x: '/'.join(map(str, sorted(mapper.get(x, set())|{x}))))

Output:

     product code         Name  Consumption A  Consumption B  Consumption C  Consumption D
0  123/134/179/426  AA/AD/ED/FD            118            161            199            271
1          456/467        BB/BD              6             10             14             18
2              789           CC             12              5             33             89

Answered By - mozway

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 9, 2024

[FIXED] How can I use custom groups and groupby.agg to merge rows in a df by concatenating certain columns and summing the rest based on another df?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels