Issue
I have a CSV file that looks like
time | Col_A | Col_B | Col_C | Col_D | Price |
---|---|---|---|---|---|
123 | A1 | B1 | C2 | D6 | 23.43 |
124 | A5 | B3 | C7 | D1 | 14.63 |
125 | A3 | B2 | C3 | D2 | 343.43 |
126 | A2 | B1 | C2 | D6 | 43.43 |
127 | A1 | B1 | C7 | D2 | 6.63 |
Now I want to create a 5-dimensional matrix. I think it is called a tensor?
Question 1) Can this tensor be done in Pandas, or is there a better Python library? Do you have a link to an API manual so I can read more?
The dimensions of the tensor will be "time", "Col_A", "Col_B", "Col_C", "Col_D". The cell will contain the scalar "Price".
So I want to be able to set the cell, maybe something like:
my_matrix[time=123, Col_A = A1, Col_B = B1, Col_C = C2, Col_D = D6] = 23.43
Question 2) So what is the syntax to set a scalar? And how do I read the scalar?
I would also like to sum over dimensions. Say I want to sum like this below. When I write "*" I mean the star is a wildcard.
Matrix[time = *, Col_A = A1, Col_B = B1 and B2 and B3, Col_C = *, Col_D = D6 and D2]
Question 3) How do I sum over different dimensions? Do I need to for loop? Can I use other operations, such as divisions?
Actually, I want to put several numbers in each cell. Maybe I would like to put "time" and "Price" into each cell. I was thinking about string concatenation "time_Price", but to do a large summation, there will be lot of substring extraction, which might be time consuming? Therefore, I was thinking of creating several identical tensors. One tensor might contain the "Price" in each cell, and another tensor might contain "time" in each cell. So if I want to check the time for a Price, I will use the same coordinates in both tensors.
Question 4) I guess it is faster to use several tensors where each cell contain a single value, instead of using one single tensor containing many variables i.e. long string of concatenated variables? Or is there a better way?
Solution
It seems that a dataframe is already an appropriate structure for your data. To answer your questions I'll use your sample data, indexed by the Col_*
columns:
# set up df
df = pd.DataFrame({
'time': [123, 124, 125, 126, 127],
'Col_A': ['A1', 'A5', 'A3', 'A2', 'A1'],
'Col_B': ['B1', 'B3', 'B2', 'B1', 'B1'],
'Col_C': ['C2', 'C7', 'C3', 'C2', 'C7'],
'Col_D': ['D6', 'D1', 'D2', 'D6', 'D2'],
'Price': [23.43, 14.63, 343.43, 43.43, 6.63]
})
# set the index to the `Col_*` columns and sort for performance
df = df.set_index(['Col_A','Col_B','Col_C','Col_D']).sort_index()
Now we have a dataframe where the Col_*
values are the indexes
and the time
and Price
values are in the "cells".
This is basically the answer to Question 1.
time Price
Col_A Col_B Col_C Col_D
A1 B1 C2 D6 123 23.43
C7 D2 127 6.63
A2 B1 C2 D6 126 43.43
A3 B2 C3 D2 125 343.43
A5 B3 C7 D1 124 14.63
For Question 2 you can just use the index to set the "cell" value:
df.loc[('A1','B1','C2','D6'), 'Price'] = 55
or both "cells" at once:
df.loc[('A1','B1','C7','D2'), ['time', 'Price']] = [100, 99]
or with a condition:
df.loc[(df.index == ('A2','B1','C2','D6')) & (df['time'] == 126), 'Price'] = 44
Output:
time Price
Col_A Col_B Col_C Col_D
A1 B1 C2 D6 123 55.00
C7 D2 100 99.00
A2 B1 C2 D6 126 44.00
A3 B2 C3 D2 125 343.43
A5 B3 C7 D1 124 14.63
For Question 3, use indexing to access the values you want; with slice(None)
for the wildcard:
df.loc[('A1', ['B1','B2','B3'], slice(None), ['D2','D6'])].sum()
Output:
time 223.0
Price 154.0
dtype: float64
In terms of Question 4, inside the dataframe you can have as many "cell" values as you like. As shown above, you can access the "cell" values individually or together as needed. For example, let's say you wanted address and gender information. Then you could set a few of those values like this:
# set an existing entry
df.loc[('A1','B1','C7','D2'), ['Address', 'Gender']] = ['1 Main St', 'male']
# create a new entry with all 4 values
# (time, price, address, gender)
df.loc[('A4', 'B4', 'C4', 'D4')] = [130, 45.3, 'High St', 'female']
# create a new entry with only 3 of the 4 values
df.loc[('A1', 'B2', 'C3', 'D4'), ['time', 'Price', 'Address']] = [120, 51.4, '4 Brown St']
Starting from the original dataframe, this gives:
time Price Address Gender
A1 B1 C2 D6 123.0 23.43 NaN NaN
C7 D2 127.0 6.63 1 Main St male
A2 B1 C2 D6 126.0 43.43 NaN NaN
A3 B2 C3 D2 125.0 343.43 NaN NaN
A5 B3 C7 D1 124.0 14.63 NaN NaN
A4 B4 C4 D4 130.0 45.30 High St female
A1 B2 C3 D4 120.0 51.40 4 Brown St NaN
Note that for performance reasons, you should use df = df.sort_index()
after adding new entries.
You can also search this dataframe easily. For example, find all rows with Price > 50
:
df[df['Price'] > 50]
Output:
time Price Address Gender
A3 B2 C3 D2 125.0 343.43 NaN NaN
A1 B2 C3 D4 120.0 51.40 4 Brown St NaN
or time < 125
and no gender specified:
df[(df['time'] < 125) & df['Gender'].isna()]
Output:
time Price Address Gender
A1 B1 C2 D6 123.0 23.43 NaN NaN
A5 B3 C7 D1 124.0 14.63 NaN NaN
A1 B2 C3 D4 120.0 51.40 4 Brown St NaN
Answered By - Nick
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.