Saturday, October 30, 2021

[FIXED] Identify customer segments based on transactions that they have made in specific period using Python

October 30, 2021 pandas, python, python-3.x, python-datetime No comments

Issue

For customer segmentation purpose, I want to analyse, How many transactions did the customer do in prior 10 days & 20 days based on given table of transaction records with date. Click here to view table / output of below code. In this table, the last 2 columns are joined by using the following code.

But I'm not satisfied with this code, please suggest me improvement.

import pandas as pd

df4 = pd.read_excel(path)

# Since A and B two customers are there, two separate dataframe created

df4A = df4[df4['Customer_ID'] == 'A']
df4B = df4[df4['Customer_ID'] == 'B']

from datetime import date
from dateutil.relativedelta import relativedelta

txn_prior_10days = []

for i in range(len(df4)):
    
    current_date = df4.iloc[i,2]
    prior_10days_date = current_date - relativedelta(days=10)
    
    if df4.iloc[i,1] == 'A':
        No_of_txn = ((df4A['Transaction_Date'] >= prior_10days_date) & (df4A['Transaction_Date'] < current_date)).sum()
        txn_prior_10days.append(No_of_txn)
    
    elif df4.iloc[i,1] == 'B':
        No_of_txn = ((df4B['Transaction_Date'] >= prior_10days_date) & (df4B['Transaction_Date'] < current_date)).sum()
        txn_prior_10days.append(No_of_txn)

txn_prior_20days = []

for i in range(len(df4)):
    
    current_date = df4.iloc[i,2]
    prior_20days_date = current_date - relativedelta(days=20)
    
    if df4.iloc[i,1] == 'A':
        no_of_txn = ((df4A['Transaction_Date'] >= prior_20days_date) & (df4A['Transaction_Date'] < current_date)).sum()
        txn_prior_20days.append(no_of_txn)
    
    elif df4.iloc[i,1] == 'B':
        no_of_txn = ((df4B['Transaction_Date'] >= prior_20days_date) & (df4B['Transaction_Date'] < current_date)).sum()
        txn_prior_20days.append(no_of_txn) 

df4['txn_prior_10days'] = txn_prior_10days
df4['txn_prior_20days'] = txn_prior_20days
df4

Solution

Your code would be very difficult to write if you had e.g. 10 different Customer_IDs. Fortunately, there is much shorter solution:

When you read your file, convert Transaction_Date to datetime, e.g. passing parse_dates=['Transaction_Date'] to read_excel.
Define a fuction counting how many dates in group (gr) are within the range between tDlt (Timedelta) and 1 day before the current date (dd):
```
def cntPrevTr(dd, gr, tDtl):
    return gr.between(dd - tDtl, dd - pd.Timedelta(1, 'D')).sum()
```
It will be applied twice to each member of the current group by Customer_ID (actually to Transaction_Date column only), once with tDtl == 10 days and second time with tDlt == 20 days.

Define a function counting both columns containing the number of previous transactions, for the current group of transaction dates:

def priorTx(td):
    return pd.DataFrame({
        'tx10' : td.apply(cntPrevTr, args=(td, pd.Timedelta(10, 'D'))),
        'tx20' : td.apply(cntPrevTr, args=(td, pd.Timedelta(20, 'D')))})

Generate the result:
```
df[['txn_prior_10days', 'txn_prior_20days']] = df.groupby('Customer_ID')\
    .Transaction_Date.apply(priorTx)
```
The code above:
- groups df by Customer_ID,
- takes from the current group only Transaction_Date column,
- applies priorTx function to it,
- saves the result in 2 target columns.

The result, for a bit shortened Transaction_ID, is:

   Transaction_ID Customer_ID Transaction_Date  txn_prior_10days  txn_prior_20days
0          912410           A       2019-01-01                 0                 0   
1          912341           A       2019-01-03                 1                 1   
2          312415           A       2019-01-09                 2                 2   
3          432513           A       2019-01-12                 2                 3   
4          357912           A       2019-01-19                 2                 4   
5          912411           B       2019-01-06                 0                 0   
6          912342           B       2019-01-11                 1                 1   
7          312416           B       2019-01-13                 2                 2   
8          432514           B       2019-01-20                 2                 3   
9          357913           B       2019-01-21                 3                 4

You cannot use rolling computation, because:

the rolling window extends forward from the current row, but you want to count previous transactions,
rolling calculations include the current row, whereas you want to exclude it.

This is why I came up with the above solution (just 8 lines of code).

Details how my solution works

To see all details, create the test DataFrame the following way:

import io

txt = '''
Transaction_ID Customer_ID Transaction_Date
912410         A           2019-01-01
912341         A           2019-01-03
312415         A           2019-01-09
432513         A           2019-01-12
357912         A           2019-01-19
912411         B           2019-01-06
912342         B           2019-01-11
312416         B           2019-01-13
432514         B           2019-01-20
357913         B           2019-01-21'''

df = pd.read_fwf(io.StringIO(txt), skiprows=1,
    widths=[15, 12, 16], parse_dates=[2])

Perform groupby, but for now retrieve only group with key 'A':

gr = df.groupby('Customer_ID')
grp = gr.get_group('A')

It contains:

   Transaction_ID Customer_ID Transaction_Date
0          912410           A       2019-01-01
1          912341           A       2019-01-03
2          312415           A       2019-01-09
3          432513           A       2019-01-12
4          357912           A       2019-01-19

Let's start from the most detailed issue, how works cntPrevTr. Retrieve one of dates from grp:

dd = grp.iloc[2,2]

It contains Timestamp('2019-01-09 00:00:00'). To test example invocation of cntPrevTr for this date, run:

cntPrevTr(dd, grp.Transaction_Date, pd.Timedelta(10, 'D'))

i.e. you want to check how many prior transaction performed this customer before this date, but not earlier than 10 days back. The result is 2.

To see how the whole first column is computed, run:

td = grp.Transaction_Date
td.apply(cntPrevTr, args=(td, pd.Timedelta(10, 'D')))

The result is:

0    0
1    1
2    2
3    2
4    2
Name: Transaction_Date, dtype: int64

The left column is the index and the right - values returned from cntPrevTr call for each date.

And the last thing is to show, how the result for the whole group is generated. Run:

priorTx(grp.Transaction_Date)

The result (a DataFrame) is:

   tx10  tx20
0     0     0
1     1     1
2     2     2
3     2     3
4     2     4

The same procedure takes place for all other groups, then all partial results are concatenated (vertically) and the last step is to save both columns of the whole DataFrame in respective columns of df.

Answered By - Valdi_Bo

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 30, 2021

[FIXED] Identify customer segments based on transactions that they have made in specific period using Python

Issue

Solution

Details how my solution works

0 comments:

Post a Comment

Popular Posts

Labels