Tuesday, April 26, 2022

[FIXED] Generate binary outcome dummy data based on probability of items and its feature

April 26, 2022 dataframe, numpy, pandas, python, scikit-learn No comments

Issue

I want to generate a synthetic data from scratch which is a binary outcome sequence data (0/1). My data has following property-

For the sake of an example, lets say there are only 3 items in the sequence, namely A,B and C So data is -

Its sequence based data so item A,B,C will happen in an order
Items A,B,C have Features S,T,U,V,X,Y,Z...etc (these features needs to have some effect on generating outcome 1, think of them as feature importance)
Probability of conversion when A or B or C is encountered in the data is user defined (I want control over if A occurs in any part of the sequence the overall probability of conversion to outcome 1 is 2% lets say, more below)
Items can repeat in a sequence so a Sequence can be like C->C->A etc .

Given the probability of conversion for each item when it occurs in data (like when ever A is encountered in the sequence, probability of outcome 1 is about 2%, when B occurs, its 2.6% and so on, just an example), I want to generate data randomly. So generated data should look something like this -

ID Sequence Feature Outcome

1   A->B     X       0
2   C->C->B  Y       1
3   A->B     X       1
4    A       Z       0
5   A->B->A  Z       0
6   C->C     Y       1

and so on

When generating this data, I want to have control over -

Conversion probability of A,B and C essentially defining when A occurs probability of conversion is let say 2%, for B is 4% and for C is 3.6%.
Number of converted sequence for each sequence length (for example there can be max 3 sequence so for 3 sequence I want at-least 100000 data points having outcome 1)
Control over how many Items I can include (so A,B,C and D, 4 sequence length instead of 3)
Total number of data points if possible?

Is there any simple way through which I generate this data with keeping in mind all these parameters?

Solution

import pandas as pd
import itertools
import numpy as np
import random


alphabets=['A','B','C']

combinations=[]
for i in range(1,len(alphabets)+1):
               combinations.append(['->'.join(i) for i in itertools.product(alphabets, repeat = i)])
combinations=(sum(combinations, []))

weights=np.random.normal(100,30,len(combinations))
weights/=sum(weights)
weights=weights.tolist()
#weights=np.random.dirichlet(np.ones(len(combinations))*1000.,size=1)
'''n = len(combinations)
weights = [random.random() for _ in range(n)]
sum_weights = sum(weights)
weights = [w/sum_weights for w in weights]'''


df=pd.DataFrame(random.choices(
    population=combinations,weights=weights,
    k=1000000),columns=['sequence'])

# -

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
plt.hist(weights, bins = 20) 
plt.show()

distribution=df.groupby('sequence').agg({'sequence':'count'}).rename(columns={'sequence':'Total_Numbers'}).reset_index()
plt.hist(distribution.Total_Numbers) 
plt.show()

# + tags=[]
from tqdm import tqdm

A=0.2
B=0.8
C=0.1
count_AAA=count_AA=count_A=0
count_BBB=count_BB=count_B=0
count_CCC=count_CC=count_C=0

for i in tqdm(range(0,len(df))):
    if(df.sequence[i]=='A->A->A'):
        count_AAA+=1
    if('A->A' in df.sequence[i]):
        count_AA+=1
    if('A' in df.sequence[i]):
        count_A+=1
    if(df.sequence[i]=='B->B->B'):
        count_BBB+=1
    if('B->B' in df.sequence[i]):
        count_BB+=1
    if('B' in df.sequence[i]):
        count_B+=1
    if(df.sequence[i]=='C->C->C'):
        count_CCC+=1
    if('C->C' in df.sequence[i]):
        count_CC+=1
    if('C' in df.sequence[i]):
        count_C+=1
bi_AAA = np.random.binomial(1, A*0.9, count_AAA)
bi_AA = np.random.binomial(1, A*0.5, count_AA)
bi_A = np.random.binomial(1, A*0.1, count_A)

bi_BBB = np.random.binomial(1, B*0.9, count_BBB)
bi_BB = np.random.binomial(1, B*0.5, count_BB)
bi_B = np.random.binomial(1, B*0.1, count_B)

bi_CCC = np.random.binomial(1, C*0.9, count_CCC)
bi_CC = np.random.binomial(1, C*0.5, count_CC)
bi_C = np.random.binomial(1, C*0.15, count_C)
# -

bi_BBB.sum()/count_BBB

# + tags=[]
AAA=AA=A=BBB=BB=B=CCC=CC=C=0

for i in tqdm(range(0,len(df))):
    if(df.sequence[i]=='A->A->A'):
        df.at[i, 'Outcome_AAA'] = bi_AAA[AAA]
        AAA+=1
    if('A->A' in df.sequence[i]):
        df.at[i, 'Outcome_AA'] = bi_AA[AA]
        AA+=1
    if('A' in df.sequence[i]):
        df.at[i, 'Outcome_A'] = bi_A[A]
        A+=1
    if(df.sequence[i]=='B->B->B'):
        df.at[i, 'Outcome_BBB'] = bi_BBB[BBB]
        BBB+=1
    if('B->B' in df.sequence[i]):
        df.at[i, 'Outcome_BB'] = bi_BB[BB]
        BB+=1
    if('B' in df.sequence[i]):
        df.at[i, 'Outcome_B'] = bi_B[B]
        B+=1
    if(df.sequence[i]=='C->C->C'):
        df.at[i, 'Outcome_CCC'] = bi_CCC[CCC]
        CCC+=1
    if('C->C' in df.sequence[i]):
        df.at[i, 'Outcome_CC'] = bi_CC[CC]
        CC+=1
    if('C' in df.sequence[i]):
        df.at[i, 'Outcome_C'] = bi_C[C]
        C+=1
        
df=df.fillna(0)       


df['Outcome']=df.apply(lambda x: 1 if x.Outcome_AAA+x.Outcome_BBB+x.Outcome_CCC+\
                       x.Outcome_AA+x.Outcome_BB+x.Outcome_CC+\
                       x.Outcome_A+x.Outcome_B+x.Outcome_C>0 else 0,1)
dataset=df[['sequence','Outcome']]

Answered By - Kshitij Yadav

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, April 26, 2022

[FIXED] Generate binary outcome dummy data based on probability of items and its feature

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels