Wednesday, October 5, 2022

[FIXED] scipy stats correlation to data frame with column names

October 05, 2022 dataframe, numpy, scipy No comments

Issue

When scipy.stats.spearmanr(df1,df2) is used on two data frames you can unstack the output values to get 2 numpy arrays of correlation and p-values. How do you put them back into a data frame with the same column names as the original data frames?

correlation, p_value = scipy.stats.spearmanr(df1, df2)

yields 2 numpy arrays but would like to get data frame with correlation and p_value next to each column for x columns as in:

not obvious...here is some code to get random data (suggested by ouroboros1)

import scipy.stats
import numpy as np
import pandas as pd
# Create array of 5 rows and 3 columns,
# filled with random values from 1 to 10 
data = np.random.randint(1,10,size=(5,3))
# Create Dataframe with data
df1 = pd.DataFrame(data, columns=['Col_1','Col_2','Col_3'])
# Set target column in new data frame
df2 = df1['Col_3'].to_frame().copy()
df1.drop(['Col_3'], axis=1, inplace=True)
# Obtain correlation coefficient for each 
# value in column   against value in every other column
correlation, p_value = scipy.stats.spearmanr(df1, df2)

(edited) I think I found a way but there must be a shorter path:

# Concat frames that contain features with frame that
# contains target
correlation_frame = pd.concat([df1, df2], axis=1)
col_cor_list = list(correlation_frame)
# Create new frame containing all correlation coefficients
cor = pd.DataFrame(correlation, columns=col_list)
cor
    Col_1     Col_2        Col_3
0   1.000000    -0.883883   -0.229416
1   -0.883883   1.000000    0.081111
2   -0.229416   0.081111    1.000000

# Get a dataframe for p-values
col_p_list = ['p_value' + str(x) for x in
              range(1,len(p_value)+1)]
p_frame = pd.DataFrame(p_value, columns=col_p_list)
p_frame

    p_value1        p_value2    p_value3
0   1.404265e-24    0.046662    0.710482
1   4.666188e-02    0.000000    0.896840
2   7.104817e-01    0.896840    0.000000

# Combine values by alternating column names
# so p-values are placed correctly
alter_names = (list(sum(zip(col_cor_list, col_p_list), ())))
final = cor.join(p_frame)
results = final[alter_names]
results

   Col_1            p_value1       Col_2    p_value2      Col_3 p_value3
0   1.000000    1.404265e-24    -0.883883   0.046662 -0.229416  0.710482
1   -0.883883   4.666188e-02    1.000000    0.000000    0.081111    0.896840
2   -0.229416   7.104817e-01    0.081111    0.896840    1.000000    0.000000

Solution

Suggested refactored version:

Get correlation and p_value

import scipy.stats
import numpy as np
import pandas as pd

n = 3
data = np.random.randint(1,10,size=(5,n))
# create cols `Col_1`, `_2` etc.
cols = [f'Col_{i+1}' for i in range(n)]
df1 = pd.DataFrame(data, columns=cols)

# create `df2` based on index, and drop
df2 = df1.iloc[:,-1].to_frame().copy()
df1 = df1.iloc[:,:-1]
correlation, p_value = scipy.stats.spearmanr(df1, df2)

Turn into df

Use np.hstack to stack correlation and p_value in sequence horizontally.
Use sorted to sort the columns. Key consists of consecutive numbers at end of each col name, which we extract by using rsplit.

res = pd.DataFrame(np.hstack((correlation,p_value)), columns=cols
                   +[f'p_value_{i+1}' for i in range(n)], index=cols)

res = res.loc[:,sorted(res.columns, 
                       key=lambda x: int(x.rsplit('_', maxsplit=1)[1]))]

print(res)

          Col_1  p_value_1     Col_2     p_value_2     Col_3  p_value_3
Col_1  1.000000   0.000000  0.410391  4.925358e-01  0.648886   0.236154
Col_2  0.410391   0.492536  1.000000  1.404265e-24 -0.263523   0.668397
Col_3  0.648886   0.236154 -0.263523  6.683968e-01  1.000000   0.000000

So, the only variable here is n. E.g. for n = 5, it will produce:

          Col_1     p_value_1     Col_2  ...  p_value_4     Col_5  p_value_5
Col_1  1.000000  1.404265e-24 -0.461690  ...   0.552815  0.307794   0.614384
Col_2 -0.461690  4.337662e-01  1.000000  ...   0.334035 -0.763158   0.133339
Col_3  0.666886  2.188940e-01 -0.263158  ...   0.362245  0.500000   0.391002
Col_4  0.359092  5.528147e-01  0.552632  ...   0.000000 -0.157895   0.799801
Col_5  0.307794  6.143840e-01 -0.763158  ...   0.799801  1.000000   0.000000

Answered By - ouroboros1

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, October 5, 2022

[FIXED] scipy stats correlation to data frame with column names

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels