Saturday, December 23, 2023

[FIXED] select data with np.where with multiple index

December 23, 2023 dataframe, numpy, pandas, python No comments

Issue

Update: I tried the code by Valdi_Bo with some modification

df = df.dropna()
print(df.head(5))
a = df.to_numpy()
row, col = np.where(np.greater(a, 0.8) & np.less(a, 1.0))
selected_corr_df = pd.DataFrame([ (df.index[r][0], df.index[r][1], df.columns[c], df.iloc[r, c])
                                 for r, c in np.c_[row, col] ], columns = ['datetime','code', 'col', 'val'])

so that datetime is included however, the result is

             datetime     code       col       val
0 2023-01-01 20:00:00  BTCUSDT   BTCUSDT  1.000000
1 2023-01-01 20:00:00  BTCUSDT   ETHUSDT  0.839907
2 2023-01-01 20:00:00  ETHUSDT   BTCUSDT  0.839907
3 2023-01-01 20:00:00  BNBUSDT   BNBUSDT  1.000000
4 2023-01-01 20:00:00  XRPUSDT  DOGEUSDT  0.804122

Don't know why 1.0 correlation is still included I tried to print(row,col)

print(row,col)
[    0     0     1 ... 65307 65314 65317] [0 1 0 ... 7 4 7]
print(df.head(5).iloc[:,:5])
                              BTCUSDT   ETHUSDT   BNBUSDT   XRPUSDT   SOLUSDT
datetime            code                                                     
2023-01-01 20:00:00 BTCUSDT  1.000000  0.839907  0.628913  0.459101  0.685730
                    ETHUSDT  0.839907  1.000000  0.654618  0.401911  0.572358
                    BNBUSDT  0.628913  0.654618  1.000000  0.494672  0.561099
                    XRPUSDT  0.459101  0.401911  0.494672  1.000000  0.525257
                    SOLUSDT  0.685730  0.572358  0.561099  0.525257  1.000000

looks like some problem occour on the np.where function

I have the following dataframe, corr_df

                                BTCUSDT   ETHUSDT   BNBUSDT
datetime            code                                   
2023-09-30 22:00:00 BTCUSDT    1.000000  0.744847  0.758208
                    ETHUSDT    0.744847  1.000000  0.788360
                    BNBUSDT    0.758208  0.788360  1.000000
                    XRPUSDT    0.175308  0.165487  0.330017
                    SOLUSDT    0.392990  0.433611  0.573683
                    ADAUSDT    0.326387  0.465547  0.555164
                    DOGEUSDT   0.584677  0.572698  0.658798
                    TRXUSDT    0.396398  0.150638  0.330389
                    MATICUSDT  0.640225  0.603454  0.663140
                    DOTUSDT    0.741107  0.758502  0.844142
2023-09-30 23:00:00 BTCUSDT    1.000000  0.739415  0.775631
                    ETHUSDT    0.739415  1.000000  0.798653
                    BNBUSDT    0.775631  0.798653  1.000000
                    XRPUSDT    0.185117  0.172995  0.334012
                    SOLUSDT    0.407861  0.444603  0.579565
                    ADAUSDT    0.324440  0.462648  0.543445
                    DOGEUSDT   0.593054  0.577852  0.667520
                    TRXUSDT    0.414259  0.164620  0.344475
                    MATICUSDT  0.655499  0.613226  0.670288
                    DOTUSDT    0.744608  0.759279  0.846514

I want to get the pair names that meet the condition, with datetime as index

I tried to select the data with np.where the condition is

np.where(corr_dict[corr_len]> 0.8) & (corr_dict[corr_len] != 1)

which output 2 arrays I used 2 variable to store it

row, col = np.where(corr_dict[corr_len]> 0.8) & (corr_dict[corr_len] != 1)

However, I stuck when I tried to create a new df, named selected_corr_df I tried to get the column name as 'pair1' column and 2nd index name as 'pair2' column

selected_corr_df = pd.DataFrame(index=corr_dict[corr_len].index, columns=['pair1', 'pair2'])

But I only create errors

Solution

To create your source DataFrame I used the following code:

txt = '''\
datetime            code       BTCUSDT   ETHUSDT   BNBUSDT
2023-09-30 22:00:00 BTCUSDT    1.000000  0.744847  0.758208
2023-09-30 22:00:00 ETHUSDT    0.744847  1.000000  0.788360
2023-09-30 22:00:00 BNBUSDT    0.758208  0.788360  1.000000
2023-09-30 22:00:00 XRPUSDT    0.175308  0.165487  0.330017
2023-09-30 22:00:00 SOLUSDT    0.392990  0.433611  0.573683
2023-09-30 22:00:00 ADAUSDT    0.326387  0.465547  0.555164
2023-09-30 22:00:00 DOGEUSDT   0.584677  0.572698  0.658798
2023-09-30 22:00:00 TRXUSDT    0.396398  0.150638  0.330389
2023-09-30 22:00:00 MATICUSDT  0.640225  0.603454  0.663140
2023-09-30 22:00:00 DOTUSDT    0.741107  0.758502  0.844142
2023-09-30 23:00:00 BTCUSDT    1.000000  0.739415  0.775631
2023-09-30 23:00:00 ETHUSDT    0.739415  1.000000  0.798653
2023-09-30 23:00:00 BNBUSDT    0.775631  0.798653  1.000000
2023-09-30 23:00:00 XRPUSDT    0.185117  0.172995  0.334012
2023-09-30 23:00:00 SOLUSDT    0.407861  0.444603  0.579565
2023-09-30 23:00:00 ADAUSDT    0.324440  0.462648  0.543445
2023-09-30 23:00:00 DOGEUSDT   0.593054  0.577852  0.667520
2023-09-30 23:00:00 TRXUSDT    0.414259  0.164620  0.344475
2023-09-30 23:00:00 MATICUSDT  0.655499  0.613226  0.670288
2023-09-30 23:00:00 DOTUSDT    0.744608  0.759279  0.846514
'''
df = pd.read_fwf(io.StringIO(txt), widths=[20, 11, 10, 10, 10])
df.datetime = pd.to_datetime(df.datetime)
df.set_index(['datetime', 'code'], inplace=True)

Then I created a Numpy array:

a = df.to_numpy()

To get row / column numbers of elements of interest, I ran:

row, col = np.where(np.greater(a, 0.8) & np.less(a, 1.0))

And finally, to get the result with three columns (code, col and val), I ran:

selected_corr_df = pd.DataFrame([ (df.index[r][1], df.columns[c], df.iloc[r, c])
    for r, c in np.c_[row, col] ], columns = ['code', 'col', 'val'])

The result is:

      code      col       val
0  DOTUSDT  BNBUSDT  0.844142
1  DOTUSDT  BNBUSDT  0.846514

Answered By - Valdi_Bo

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, December 23, 2023

[FIXED] select data with np.where with multiple index

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels