Issue
I have a pandas data frame like this
EDITED
Promotion ID | Month | Products |
---|---|---|
PID-1 | June | Refer below for sample1 |
PID-2 | July | Refer below for sample2 |
sample1: | Product Id| |--| |PROD1| |PROD2|
sample2: | Product Id| |--| |PROD1| |PROD2| |PROD3|
I would like to transform this data frame into the following
Promotion ID | Month | Products |
---|---|---|
PID-1 | June | PROD1 |
PROD2 | ||
PID-2 | July | PROD 1 |
PROD2 | ||
PROD3 |
The empty spaces can just be None
or NA
values. Is there a way to do this in pandas without looping through the rows?
Solution
You could use explode
to flatten your dataframe like this:
#generating data
df = pd.DataFrame([
['pid-1', 'June', '| Product Id| |PROD1| |PROD2|'],
['pid-2', 'July', '| Product Id| |PROD1| |PROD2| |PROD3|']
], columns = ['Promotion ID', 'Month', 'Products'])
# extracting the product list
df['Products'] = df['Products']\
.apply(lambda s: [x for x in re.split(' *\| *', s) if x != '' and x != 'Product Id'])
exploded_df = exploded_df = df.explode('Products', ignore_index=True)
At this point df
and exploded_df
look like this:
# df
Promotion ID Month Products
0 pid-1 June [PROD1, PROD2]
1 pid-2 July [PROD1, PROD2, PROD3]
# exploded_df
Promotion ID Month Products
0 pid-1 June PROD1
1 pid-1 June PROD2
2 pid-2 July PROD1
3 pid-2 July PROD2
4 pid-2 July PROD3
I would stop there. IMHO only keeping the values of Month
and Promotion ID
for the first rows will only make your like harder. Yet, since you ask you could use rank
and loc
to assign None
to all the rows that are not the first of their groups:
# rank needs a numeric column
exploded_df['index'] = exploded_df.index
# using rank to create a filter on rows that are not the first of their group
filter = exploded_df\
.groupby(['Promotion ID'])['index']\
.rank('dense').apply(lambda x: x > 1)
# getting rid of the index column
exploded_df = exploded_df.drop('index', axis=1)
# and voila
exploded_df.loc[filter, ['Month', 'Promotion ID']] = None
The result:
Promotion ID Month Products
0 None None PROD1
1 pid-1 June PROD2
2 None None PROD1
3 pid-2 July PROD2
4 pid-2 July PROD3
Answered By - Oli
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.