Issue
this is my first post and I will try my best to describe the problem.
I have a table below to help illustrate. I am trying to combine some transactions for each Account ID. The logic I want is: Account AAAAA had their 1st transaction "Close Date" on 2019-12-16, if their following transactions are "closed" within 1 month after the 1st (i.e. window 2019-12-16 ~ 2020-01-16), then I want to combine those transactions to be just one single transaction, but use the "Created Date" from the 1st transaction (i.e 2019-12-15) and "Close Date" from the last transaction within the window (i.e 2020-01-02). Then after that window, the next transaction for Account AAAAA is "closed" on 2020-01-22, the window will be 2020-01-22 ~ 2020-02-22, but they only have one transaction in that window, so it just stays as a single transaction. I want to implement this logic for each Account ID.
Initial Table:
Account ID | Created Date | Close Date | Product Code |
---|---|---|---|
AAAAAA | 2019-12-15 | 2019-12-16 | Apple; Banana |
AAAAAA | 2019-12-20 | 2019-12-26 | Strawberry |
AAAAAA | 2020-01-02 | 2020-01-02 | Apple; Banana |
AAAAAA | 2020-01-15 | 2020-01-22 | Apple; Banana |
BBBBBB | 2020-02-04 | 2020-02-05 | Watermelon |
BBBBBB | 2020-03-09 | 2020-03-10 | Strawberry |
BBBBBB | 2020-03-10 | 2020-03-15 | Apple; Banana |
import pandas as pd
data = [['AAAAA', '2019-12-15', '2019-12-16', 'Apple; Banana'],
['AAAAA', '2019-12-20', '2019-12-26', 'Strawberry'],
['AAAAA', '2020-01-02', '2020-01-02', 'Apple; Banana'],
['AAAAA', '2020-01-15', '2020-01-22', 'Apple; Banana'],
['BBBBB', '2020-02-04', '2020-02-05', 'Watermelon'],
['BBBBB', '2020-03-09', '2020-03-10', 'Strawberry'],
['BBBBB', '2020-03-10', '2020-03-15', 'Apple; Banana']]
initial_df = pd.DataFrame(data, columns=['Account ID', 'Created Date', 'Close Date', 'Products'])
initial_df
After:
Account ID | Created Date | Close Date | Product Code |
---|---|---|---|
AAAAAA | 2019-12-15 | 2020-01-02 | Apple; Banana; Strawberry; Apple; Banana |
AAAAAA | 2020-01-15 | 2020-01-22 | Apple; Banana |
BBBBB | 2020-02-04 | 2020-02-05 | Watermelon |
BBBBB | 2020-03-09 | 2020-03-15 | Strawberry; Apple; Banana |
Is there a way I can achieve this? I will greatly appreciate you help!!
Solution
I was able to figure it out myself using below codes, but if there is a better way, I would love to know as well.
from dateutil.relativedelta import relativedelta
grouped = initial_df.groupby(by=['Account ID'])
window_list = []
for a, b in grouped:
window_end_date = list(b['Close Date'])[0] + dateutil.relativedelta.relativedelta(months=1)
for i in b['Close Date']:
if pd.to_datetime(i) <= window_end_date:
window_list.append(window_end_date)
elif pd.to_datetime(i) > window_end_date:
window_end_date = i + dateutil.relativedelta.relativedelta(months=1)
window_list.append(window_end_date)
initial_df['window'] = window_list
created_start_list = []
close_end_list = []
grouped = initial_df.groupby(['Account ID', 'window'])
for a, b in grouped:
created_start = min(list(b['Created Date']))
created_start_list.append(created_start)
close_end = max(list(b['Close Date']))
close_end_list.append(close_end)
df = grouped['Products'].apply('; '.join).reset_index()
df['created_start'] = created_start_list
df['closed_start'] = close_end_list
df
Answered By - minxyen
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.