Issue
I have a DataFrame with roughly 500 columns. I can easily compute a correlation matrix on this data using df.corr()
. However, I would like to know more than just the correlation for each pairwise column comparison. Specifically, I want to know how many rows in my source DataFrame had non-null records for both columns in the pair, so that I know the number of records used to calculate the correlation. The DataFrame is sparse enough that some correlations may be calculated over only a few null values while others may have thousands of contibuting data points.
Example Table:
Index | feature 1 | feature 2 | feature 3 |
---|---|---|---|
a | 2 | 3 | |
b | 7 | 2 | |
c | 1 | ||
d | 3 | 1 | 9 |
Desired Output:
-- | feature1 | feature2 | feature3 |
---|---|---|---|
feature1 | 3 | 1 (*) | 2 |
feature2 | 1 (*) | 2 | 2 (**) |
feature3 | 2 | 2 (**) | 3 |
This desired output shows me that the combination of 'feature1' and 'feature2' have only one row (*) where both have non-null values, but the combination of 'feature2' and 'feature3' have two rows (**) where both have non-null values.
I'm thinking some combination of df.melt
and df.pivot
could accomplish this, but I haven't found the right approach yet. I really don't want to iterate. I tried this, but with 500+ pairwise columns, it is never going to be fast enough.
Solution
just do a dot product on the features dataset as shown below:
a = 0 + ~df.isna()
a.T @ a
feature 1 feature 2 feature 3
feature 1 3 1 2
feature 2 1 2 2
feature 3 2 2 3
Answered By - Onyambu
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.