Issue
When scipy.stats.spearmanr(df1,df2) is used on two data frames you can unstack the output values to get 2 numpy arrays of correlation and p-values. How do you put them back into a data frame with the same column names as the original data frames?
correlation, p_value = scipy.stats.spearmanr(df1, df2)
yields 2 numpy arrays but would like to get data frame with correlation and p_value next to each column for x columns as in:
not obvious...here is some code to get random data (suggested by ouroboros1)
import scipy.stats
import numpy as np
import pandas as pd
# Create array of 5 rows and 3 columns,
# filled with random values from 1 to 10
data = np.random.randint(1,10,size=(5,3))
# Create Dataframe with data
df1 = pd.DataFrame(data, columns=['Col_1','Col_2','Col_3'])
# Set target column in new data frame
df2 = df1['Col_3'].to_frame().copy()
df1.drop(['Col_3'], axis=1, inplace=True)
# Obtain correlation coefficient for each
# value in column against value in every other column
correlation, p_value = scipy.stats.spearmanr(df1, df2)
(edited) I think I found a way but there must be a shorter path:
# Concat frames that contain features with frame that
# contains target
correlation_frame = pd.concat([df1, df2], axis=1)
col_cor_list = list(correlation_frame)
# Create new frame containing all correlation coefficients
cor = pd.DataFrame(correlation, columns=col_list)
cor
Col_1 Col_2 Col_3
0 1.000000 -0.883883 -0.229416
1 -0.883883 1.000000 0.081111
2 -0.229416 0.081111 1.000000
# Get a dataframe for p-values
col_p_list = ['p_value' + str(x) for x in
range(1,len(p_value)+1)]
p_frame = pd.DataFrame(p_value, columns=col_p_list)
p_frame
p_value1 p_value2 p_value3
0 1.404265e-24 0.046662 0.710482
1 4.666188e-02 0.000000 0.896840
2 7.104817e-01 0.896840 0.000000
# Combine values by alternating column names
# so p-values are placed correctly
alter_names = (list(sum(zip(col_cor_list, col_p_list), ())))
final = cor.join(p_frame)
results = final[alter_names]
results
Col_1 p_value1 Col_2 p_value2 Col_3 p_value3
0 1.000000 1.404265e-24 -0.883883 0.046662 -0.229416 0.710482
1 -0.883883 4.666188e-02 1.000000 0.000000 0.081111 0.896840
2 -0.229416 7.104817e-01 0.081111 0.896840 1.000000 0.000000
Solution
Suggested refactored version:
Get correlation and p_value
import scipy.stats
import numpy as np
import pandas as pd
n = 3
data = np.random.randint(1,10,size=(5,n))
# create cols `Col_1`, `_2` etc.
cols = [f'Col_{i+1}' for i in range(n)]
df1 = pd.DataFrame(data, columns=cols)
# create `df2` based on index, and drop
df2 = df1.iloc[:,-1].to_frame().copy()
df1 = df1.iloc[:,:-1]
correlation, p_value = scipy.stats.spearmanr(df1, df2)
Turn into df
- Use
np.hstack
to stack correlation and p_value in sequence horizontally. - Use
sorted
to sort the columns. Key consists of consecutive numbers at end of each col name, which we extract by usingrsplit
.
res = pd.DataFrame(np.hstack((correlation,p_value)), columns=cols
+[f'p_value_{i+1}' for i in range(n)], index=cols)
res = res.loc[:,sorted(res.columns,
key=lambda x: int(x.rsplit('_', maxsplit=1)[1]))]
print(res)
Col_1 p_value_1 Col_2 p_value_2 Col_3 p_value_3
Col_1 1.000000 0.000000 0.410391 4.925358e-01 0.648886 0.236154
Col_2 0.410391 0.492536 1.000000 1.404265e-24 -0.263523 0.668397
Col_3 0.648886 0.236154 -0.263523 6.683968e-01 1.000000 0.000000
So, the only variable here is n
. E.g. for n = 5
, it will produce:
Col_1 p_value_1 Col_2 ... p_value_4 Col_5 p_value_5
Col_1 1.000000 1.404265e-24 -0.461690 ... 0.552815 0.307794 0.614384
Col_2 -0.461690 4.337662e-01 1.000000 ... 0.334035 -0.763158 0.133339
Col_3 0.666886 2.188940e-01 -0.263158 ... 0.362245 0.500000 0.391002
Col_4 0.359092 5.528147e-01 0.552632 ... 0.000000 -0.157895 0.799801
Col_5 0.307794 6.143840e-01 -0.763158 ... 0.799801 1.000000 0.000000
Answered By - ouroboros1
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.