Issue
I have a source data table that has about 30 columns each for Attribute name and Attribute value (each is paired and numbered as in the table below. I need to try and combine these into just two columns. I can do this in power query (M code) by using the list.zip function something like below.
SOURCE DATA:
Title | Description | Attribute 1 name | Attribute 1 value | Attribute 2 name | Attribute 2 value |
---|---|---|---|---|---|
Title 1 | Desc 1 | Sport | NFL | Sport | NBA |
Title 2 | Desc 2 | Size | Large, Medium | Sleeve Type | Long Sleeve, Short Sleeve |
M Code in Power Query:
#"Added Custom" = Table.AddColumn(#"Changed Type", "Custom", each List.Zip(
{
{"Attribute 1 name", "Attribute 2 name"},
{[Attribute 1 name],[Attribute 2 name]},
{[#"Attribute 1 value(s)"],[#"Attribute 2 value(s)"]}
}))
#"Expanded Custom" = Table.ExpandListColumn(#"Added Custom", "Custom"),
#"Extracted Values" = Table.TransformColumns(#"Expanded Custom", {"Custom", each Text.Combine(List.Transform(_, Text.From), "|"), type text}),
#"Split Column by Delimiter" = Table.SplitColumn(#"Extracted Values", "Custom", Splitter.SplitTextByDelimiter("|", QuoteStyle.Csv), {"Custom.1", "Custom.2", "Custom.3"}),
#"Removed Columns" = Table.RemoveColumns(#"Split Column by Delimiter",{"Name", "Custom.1"}),
#"Split Column by Delimiter1" = Table.SplitColumn(#"Removed Columns", "Attribute Value", Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv), {"Attribute Value.1", "Attribute Value.2"}),
#"Unpivoted Columns" = Table.UnpivotOtherColumns(#"Split Column by Delimiter1", {"Attribute Name"}, "Attribute", "Value")
in
#"Unpivoted Columns"
Result in Power Query:
Attribute name | Attribute value |
---|---|
Sport | NFL |
Size | Large |
Size | Medium |
Sport | NBA |
Sleeve Type | Long Sleeve |
Sleeve Type | Short Sleeve |
Notice that there are two primary sets of columns that need to be combined, each set containing about 30 columns. Each set is a pair (Attribute 1 name
& Attribute 1 value
, all the way up to Attribute 30 name
& Attribute 30 value
) so they need to remain matched up on the same row if that makes sense.
Also, the "value" columns may or may not have comma separated values that need to be delimited and split out as well.
Finally, there are other columns to contend with that are not really needed for the final result and could drop out.
I have been trying use either pandas.wide_to_long
and a combination of pd.Index
and pd.MultiIndex.from_tuples
but cannot quite get the correct statements to get this to work. Any help would be greatly appreciated!
Solution
You could extract
the relevant information in the column names, then reshape with a MultiIndex, split
and explode
:
idx = pd.MultiIndex.from_frame(
df.columns
.str.extract(r'\S+ (\d+) (\S+)')
)
out = (df.set_axis(idx, axis=1).stack(0)
.rename_axis(columns=None)
.add_prefix('Attribute ')
.sort_index(level=-1)
.assign(**{'Attribute value': lambda d: d['Attribute value'].str.split(', *')})
.explode('Attribute value', ignore_index=True)
)
Output:
Attribute name Attribute value
0 Sport NFL
1 Size Large
2 Size Medium
3 Sport NBA
4 Sleeve Type Long Sleeve
5 Sleeve Type Short Sleeve
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.