Wednesday, December 13, 2023

[FIXED] Tensor (5-dim matrix) in python pandas?

December 13, 2023 matrix, numpy, pandas, python, tensor No comments

Issue

I have a CSV file that looks like

time	Col_A	Col_B	Col_C	Col_D	Price
123	A1	B1	C2	D6	23.43
124	A5	B3	C7	D1	14.63
125	A3	B2	C3	D2	343.43
126	A2	B1	C2	D6	43.43
127	A1	B1	C7	D2	6.63

Now I want to create a 5-dimensional matrix. I think it is called a tensor?

Question 1) Can this tensor be done in Pandas, or is there a better Python library? Do you have a link to an API manual so I can read more?

The dimensions of the tensor will be "time", "Col_A", "Col_B", "Col_C", "Col_D". The cell will contain the scalar "Price".

So I want to be able to set the cell, maybe something like:

my_matrix[time=123, Col_A = A1, Col_B = B1, Col_C = C2, Col_D = D6] = 23.43

Question 2) So what is the syntax to set a scalar? And how do I read the scalar?

I would also like to sum over dimensions. Say I want to sum like this below. When I write "*" I mean the star is a wildcard.

Matrix[time = *, Col_A = A1, Col_B = B1 and B2 and B3, Col_C = *, Col_D = D6 and D2]

Question 3) How do I sum over different dimensions? Do I need to for loop? Can I use other operations, such as divisions?

Actually, I want to put several numbers in each cell. Maybe I would like to put "time" and "Price" into each cell. I was thinking about string concatenation "time_Price", but to do a large summation, there will be lot of substring extraction, which might be time consuming? Therefore, I was thinking of creating several identical tensors. One tensor might contain the "Price" in each cell, and another tensor might contain "time" in each cell. So if I want to check the time for a Price, I will use the same coordinates in both tensors.

Question 4) I guess it is faster to use several tensors where each cell contain a single value, instead of using one single tensor containing many variables i.e. long string of concatenated variables? Or is there a better way?

Solution

It seems that a dataframe is already an appropriate structure for your data. To answer your questions I'll use your sample data, indexed by the Col_* columns:

# set up df
df = pd.DataFrame({
    'time': [123, 124, 125, 126, 127],
    'Col_A': ['A1', 'A5', 'A3', 'A2', 'A1'],
    'Col_B': ['B1', 'B3', 'B2', 'B1', 'B1'],
    'Col_C': ['C2', 'C7', 'C3', 'C2', 'C7'],
    'Col_D': ['D6', 'D1', 'D2', 'D6', 'D2'],
    'Price': [23.43, 14.63, 343.43, 43.43, 6.63]
})

# set the index to the `Col_*` columns and sort for performance
df = df.set_index(['Col_A','Col_B','Col_C','Col_D']).sort_index()

Now we have a dataframe where the Col_* values are the indexes and the time and Price values are in the "cells". This is basically the answer to Question 1.

                         time   Price
Col_A Col_B Col_C Col_D
A1    B1    C2    D6      123   23.43
            C7    D2      127    6.63
A2    B1    C2    D6      126   43.43
A3    B2    C3    D2      125  343.43
A5    B3    C7    D1      124   14.63

For Question 2 you can just use the index to set the "cell" value:

df.loc[('A1','B1','C2','D6'), 'Price'] = 55

or both "cells" at once:

df.loc[('A1','B1','C7','D2'), ['time', 'Price']] = [100, 99]

or with a condition:

df.loc[(df.index == ('A2','B1','C2','D6')) & (df['time'] == 126), 'Price'] = 44

Output:

                         time   Price
Col_A Col_B Col_C Col_D
A1    B1    C2    D6      123   55.00
            C7    D2      100   99.00
A2    B1    C2    D6      126   44.00
A3    B2    C3    D2      125  343.43
A5    B3    C7    D1      124   14.63

For Question 3, use indexing to access the values you want; with slice(None) for the wildcard:

df.loc[('A1', ['B1','B2','B3'], slice(None), ['D2','D6'])].sum()

Output:

time     223.0
Price    154.0
dtype: float64

In terms of Question 4, inside the dataframe you can have as many "cell" values as you like. As shown above, you can access the "cell" values individually or together as needed. For example, let's say you wanted address and gender information. Then you could set a few of those values like this:

# set an existing entry
df.loc[('A1','B1','C7','D2'), ['Address', 'Gender']] = ['1 Main St', 'male']

# create a new entry with all 4 values
# (time, price, address, gender)
df.loc[('A4', 'B4', 'C4', 'D4')] = [130, 45.3, 'High St', 'female']

# create a new entry with only 3 of the 4 values
df.loc[('A1', 'B2', 'C3', 'D4'), ['time', 'Price', 'Address']] = [120, 51.4, '4 Brown St']

Starting from the original dataframe, this gives:

              time   Price     Address  Gender
A1 B1 C2 D6  123.0   23.43         NaN     NaN
      C7 D2  127.0    6.63   1 Main St    male
A2 B1 C2 D6  126.0   43.43         NaN     NaN
A3 B2 C3 D2  125.0  343.43         NaN     NaN
A5 B3 C7 D1  124.0   14.63         NaN     NaN
A4 B4 C4 D4  130.0   45.30     High St  female
A1 B2 C3 D4  120.0   51.40  4 Brown St     NaN

Note that for performance reasons, you should use df = df.sort_index() after adding new entries.

You can also search this dataframe easily. For example, find all rows with Price > 50:

df[df['Price'] > 50]

Output:

              time   Price     Address Gender
A3 B2 C3 D2  125.0  343.43         NaN    NaN
A1 B2 C3 D4  120.0   51.40  4 Brown St    NaN

or time < 125 and no gender specified:

df[(df['time'] < 125) & df['Gender'].isna()]

Output:

              time  Price     Address Gender
A1 B1 C2 D6  123.0  23.43         NaN    NaN
A5 B3 C7 D1  124.0  14.63         NaN    NaN
A1 B2 C3 D4  120.0  51.40  4 Brown St    NaN

Answered By - Nick

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, December 13, 2023

[FIXED] Tensor (5-dim matrix) in python pandas?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels