Saturday, November 25, 2023

[FIXED] How to combine sets of columns in a dataframe where one column may contain lists

November 25, 2023 numpy, pandas, python No comments

Issue

I have a source data table that has about 30 columns each for Attribute name and Attribute value (each is paired and numbered as in the table below. I need to try and combine these into just two columns. I can do this in power query (M code) by using the list.zip function something like below.

SOURCE DATA:

Title	Description	Attribute 1 name	Attribute 1 value	Attribute 2 name	Attribute 2 value
Title 1	Desc 1	Sport	NFL	Sport	NBA
Title 2	Desc 2	Size	Large, Medium	Sleeve Type	Long Sleeve, Short Sleeve

M Code in Power Query:

#"Added Custom" = Table.AddColumn(#"Changed Type", "Custom", each List.Zip(
{
     {"Attribute 1 name", "Attribute 2 name"}, 
     {[Attribute 1 name],[Attribute 2 name]}, 
     {[#"Attribute 1 value(s)"],[#"Attribute 2 value(s)"]}
}))
    
#"Expanded Custom" = Table.ExpandListColumn(#"Added Custom", "Custom"),
#"Extracted Values" = Table.TransformColumns(#"Expanded Custom", {"Custom", each Text.Combine(List.Transform(_, Text.From), "|"), type text}),
#"Split Column by Delimiter" = Table.SplitColumn(#"Extracted Values", "Custom", Splitter.SplitTextByDelimiter("|", QuoteStyle.Csv), {"Custom.1", "Custom.2", "Custom.3"}),
#"Removed Columns" = Table.RemoveColumns(#"Split Column by Delimiter",{"Name", "Custom.1"}),
#"Split Column by Delimiter1" = Table.SplitColumn(#"Removed Columns", "Attribute Value", Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv), {"Attribute Value.1", "Attribute Value.2"}),
#"Unpivoted Columns" = Table.UnpivotOtherColumns(#"Split Column by Delimiter1", {"Attribute Name"}, "Attribute", "Value")
in
#"Unpivoted Columns"

Result in Power Query:

Attribute name	Attribute value
Sport	NFL
Size	Large
Size	Medium
Sport	NBA
Sleeve Type	Long Sleeve
Sleeve Type	Short Sleeve

Notice that there are two primary sets of columns that need to be combined, each set containing about 30 columns. Each set is a pair (Attribute 1 name & Attribute 1 value, all the way up to Attribute 30 name & Attribute 30 value) so they need to remain matched up on the same row if that makes sense.

Also, the "value" columns may or may not have comma separated values that need to be delimited and split out as well.

Finally, there are other columns to contend with that are not really needed for the final result and could drop out.

I have been trying use either pandas.wide_to_long and a combination of pd.Index and pd.MultiIndex.from_tuples but cannot quite get the correct statements to get this to work. Any help would be greatly appreciated!

Solution

You could extract the relevant information in the column names, then reshape with a MultiIndex, split and explode:

idx = pd.MultiIndex.from_frame(
    df.columns
      .str.extract(r'\S+ (\d+) (\S+)')                              
)

out = (df.set_axis(idx, axis=1).stack(0)
         .rename_axis(columns=None)
         .add_prefix('Attribute ')
         .sort_index(level=-1)
         .assign(**{'Attribute value': lambda d: d['Attribute value'].str.split(', *')})
         .explode('Attribute value', ignore_index=True)
      )

Output:

  Attribute name Attribute value
0          Sport             NFL
1           Size           Large
2           Size          Medium
3          Sport             NBA
4    Sleeve Type     Long Sleeve
5    Sleeve Type    Short Sleeve

Answered By - mozway

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, November 25, 2023

[FIXED] How to combine sets of columns in a dataframe where one column may contain lists

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels