Issue
I want to generate a synthetic data from scratch which is a binary outcome sequence data (0/1). My data has following property-
For the sake of an example, lets say there are only 3 items in the sequence, namely A,B and C So data is -
- Its sequence based data so item A,B,C will happen in an order
- Items A,B,C have Features S,T,U,V,X,Y,Z...etc (these features needs to have some effect on generating outcome 1, think of them as feature importance)
- Probability of conversion when A or B or C is encountered in the data is user defined (I want control over if A occurs in any part of the sequence the overall probability of conversion to outcome 1 is 2% lets say, more below)
- Items can repeat in a sequence so a Sequence can be like C->C->A etc .
Given the probability of conversion for each item when it occurs in data (like when ever A is encountered in the sequence, probability of outcome 1 is about 2%, when B occurs, its 2.6% and so on, just an example), I want to generate data randomly. So generated data should look something like this -
ID Sequence Feature Outcome
1 A->B X 0
2 C->C->B Y 1
3 A->B X 1
4 A Z 0
5 A->B->A Z 0
6 C->C Y 1
and so on
When generating this data, I want to have control over -
- Conversion probability of A,B and C essentially defining when A occurs probability of conversion is let say 2%, for B is 4% and for C is 3.6%.
- Number of converted sequence for each sequence length (for example there can be max 3 sequence so for 3 sequence I want at-least 100000 data points having outcome 1)
- Control over how many Items I can include (so A,B,C and D, 4 sequence length instead of 3)
- Total number of data points if possible?
Is there any simple way through which I generate this data with keeping in mind all these parameters?
Solution
import pandas as pd
import itertools
import numpy as np
import random
alphabets=['A','B','C']
combinations=[]
for i in range(1,len(alphabets)+1):
combinations.append(['->'.join(i) for i in itertools.product(alphabets, repeat = i)])
combinations=(sum(combinations, []))
weights=np.random.normal(100,30,len(combinations))
weights/=sum(weights)
weights=weights.tolist()
#weights=np.random.dirichlet(np.ones(len(combinations))*1000.,size=1)
'''n = len(combinations)
weights = [random.random() for _ in range(n)]
sum_weights = sum(weights)
weights = [w/sum_weights for w in weights]'''
df=pd.DataFrame(random.choices(
population=combinations,weights=weights,
k=1000000),columns=['sequence'])
# -
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
plt.hist(weights, bins = 20)
plt.show()
distribution=df.groupby('sequence').agg({'sequence':'count'}).rename(columns={'sequence':'Total_Numbers'}).reset_index()
plt.hist(distribution.Total_Numbers)
plt.show()
# + tags=[]
from tqdm import tqdm
A=0.2
B=0.8
C=0.1
count_AAA=count_AA=count_A=0
count_BBB=count_BB=count_B=0
count_CCC=count_CC=count_C=0
for i in tqdm(range(0,len(df))):
if(df.sequence[i]=='A->A->A'):
count_AAA+=1
if('A->A' in df.sequence[i]):
count_AA+=1
if('A' in df.sequence[i]):
count_A+=1
if(df.sequence[i]=='B->B->B'):
count_BBB+=1
if('B->B' in df.sequence[i]):
count_BB+=1
if('B' in df.sequence[i]):
count_B+=1
if(df.sequence[i]=='C->C->C'):
count_CCC+=1
if('C->C' in df.sequence[i]):
count_CC+=1
if('C' in df.sequence[i]):
count_C+=1
bi_AAA = np.random.binomial(1, A*0.9, count_AAA)
bi_AA = np.random.binomial(1, A*0.5, count_AA)
bi_A = np.random.binomial(1, A*0.1, count_A)
bi_BBB = np.random.binomial(1, B*0.9, count_BBB)
bi_BB = np.random.binomial(1, B*0.5, count_BB)
bi_B = np.random.binomial(1, B*0.1, count_B)
bi_CCC = np.random.binomial(1, C*0.9, count_CCC)
bi_CC = np.random.binomial(1, C*0.5, count_CC)
bi_C = np.random.binomial(1, C*0.15, count_C)
# -
bi_BBB.sum()/count_BBB
# + tags=[]
AAA=AA=A=BBB=BB=B=CCC=CC=C=0
for i in tqdm(range(0,len(df))):
if(df.sequence[i]=='A->A->A'):
df.at[i, 'Outcome_AAA'] = bi_AAA[AAA]
AAA+=1
if('A->A' in df.sequence[i]):
df.at[i, 'Outcome_AA'] = bi_AA[AA]
AA+=1
if('A' in df.sequence[i]):
df.at[i, 'Outcome_A'] = bi_A[A]
A+=1
if(df.sequence[i]=='B->B->B'):
df.at[i, 'Outcome_BBB'] = bi_BBB[BBB]
BBB+=1
if('B->B' in df.sequence[i]):
df.at[i, 'Outcome_BB'] = bi_BB[BB]
BB+=1
if('B' in df.sequence[i]):
df.at[i, 'Outcome_B'] = bi_B[B]
B+=1
if(df.sequence[i]=='C->C->C'):
df.at[i, 'Outcome_CCC'] = bi_CCC[CCC]
CCC+=1
if('C->C' in df.sequence[i]):
df.at[i, 'Outcome_CC'] = bi_CC[CC]
CC+=1
if('C' in df.sequence[i]):
df.at[i, 'Outcome_C'] = bi_C[C]
C+=1
df=df.fillna(0)
df['Outcome']=df.apply(lambda x: 1 if x.Outcome_AAA+x.Outcome_BBB+x.Outcome_CCC+\
x.Outcome_AA+x.Outcome_BB+x.Outcome_CC+\
x.Outcome_A+x.Outcome_B+x.Outcome_C>0 else 0,1)
dataset=df[['sequence','Outcome']]
Answered By - Kshitij Yadav
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.