Issue
I have a dataframe and since I have to perform many calculations on it I figured I'd give Numpy a try, so I'm just learning how to use it. This is my dataframe
df = pd.DataFrame({'col1': ['z', 'x', 'c', 'v', 'b', 'n'], 'col2': [100, 200, 300, 400, 500, 600]})
df1 = pd.DataFrame({'col1': ['z', 'x', 'c', 'v', 'b', 'n'], 'col2': [100, 212, 300, 405, 552, 641]})
df['col3'] = np.empty((len(df), 0)).tolist()
df1['col3'] = np.empty((len(df), 0)).tolist()
df2 = df.merge(df1, on='col1', how='outer')
Now what i want to do is append col2_y - col2_x - sum(col3_y)
to column col3_y
if col2_y != col2_x
. now I tried this
df2 = df2.to_numpy()
df = [df2[x, 3:4] - df2[x, 1:2] for x in np.ndindex(len(df2))]
df2 = [np.where(df2[x, 1:2] != df2[x, 3:4],
np.append(df2[x, 4:5], (df2[x, 3:4] - df2[x, 1:2]) - (df2[x, 4:5].sum())),
df2[x, 4:5]) for x in np.ndindex(len(df2))]
but somehow from this
[['z' 100 list([]) 100 list([])]
['x' 200 list([]) 212 list([])]
['c' 300 list([]) 300 list([])]
['v' 400 list([]) 405 list([])]
['b' 500 list([]) 552 list([])]
['n' 600 list([]) 641 list([])]]
It's turning into this
[array([[0]], dtype=object),
array([[12]],dtype=object),
array([[0]],dtype=object),
array([[5]], dtype=object),
array([[52]], dtype=object),
array([[41]], dtype=object)]
[array([[list([])]], dtype=object),
array([[list([])]], dtype=object),
array([[list([])]], dtype=object),
array([[list([])]], dtype=object),
array([[list([])]], dtype=object),
array([[list([])]], dtype=object)]
Am I not using the np.ndindex
correctly? Is the slicing correct at least?
Do I even need it or is there a better way to accomplish what I'm trying to do?
I appreciate any suggestions!
Solution
Your dataframe:
In [43]: df2
Out[43]:
col1 col2_x col3_x col2_y col3_y
0 z 100 [] 100 []
1 x 200 [] 212 []
2 c 300 [] 300 []
3 v 400 [] 405 []
4 b 500 [] 552 []
5 n 600 [] 641 []
and the array derived from it (note the object dtype):
In [44]: arr = df2.to_numpy()
In [45]: arr
Out[45]:
array([['z', 100, list([]), 100, list([])],
['x', 200, list([]), 212, list([])],
['c', 300, list([]), 300, list([])],
['v', 400, list([]), 405, list([])],
['b', 500, list([]), 552, list([])],
['n', 600, list([]), 641, list([])]], dtype=object)
That iterative difference - the result is actually a list:
In [46]: arr1 = [arr[x, 3:4] - arr[x, 1:2] for x in np.ndindex(len(arr))]
In [47]: arr1
Out[47]:
[array([[0]], dtype=object),
array([[12]], dtype=object),
array([[0]], dtype=object),
array([[5]], dtype=object),
array([[52]], dtype=object),
array([[41]], dtype=object)]
The same thing as Series:
In [48]: df2['col2_y']-df2['col2_x']
Out[48]:
0 0
1 12
2 0
3 5
4 52
5 41
dtype: int64
and array column different, without iteration. Object dtype math is still slower than numeric:
In [50]: arr[:,3]-arr[:,1]
Out[50]: array([0, 12, 0, 5, 52, 41], dtype=object)
A numpy integer dtype version:
In [51]: df2['col2_y'].to_numpy()-df2['col2_x'].to_numpy()
Out[51]: array([ 0, 12, 0, 5, 52, 41])
I'm not sure I want to tackle the following line
[np.where(df2[x, 1:2] != df2[x, 3:4],
np.append(df2[x, 4:5], (df2[x, 3:4] - df2[x, 1:2]) - (df2[x, 4:5].sum())),
df2[x, 4:5]) for x in np.ndindex(len(df2))]
It can be cleaned up with:
[np.where(x[1] != x[3],
np.append(x[4], (x[3] - x[1]) - sum(x[4])),
x[4])
for x in arr]
Since all the x[4]
columns are empty lists this
[array([], dtype=float64),
...
array([], dtype=float64)]
oops, somewhere in fiddling I've added values to the last lists:
In [65]: df2
Out[65]:
col1 col2_x col3_x col2_y col3_y
0 z 100 [] 100 [0]
1 x 200 [] 212 [12]
2 c 300 [] 300 [0]
3 v 400 [] 405 [5]
4 b 500 [] 552 [52]
5 n 600 [] 641 [41]
Answered By - hpaulj
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.