Issue
I need to assign a stack of rows from a pd.DataFrame by index to a matrix of known size as efficiently as possible. Many of the indices will not exist in the dataframe, and these rows must be filled with NaNs. This operation will happen inside a loop iterating over each row in the dataframe and must be as fast as possible.
In short, I want to speed up the following sequence:
# DF to iterate over
df = pd.DataFrame({'a': [1,2,3,4], 'b': [2,4,6,8], 'c': [3,6,9,12]})
# (Relative) DF indices to return if possible
indices = [-2, 0, 2, 4]
len_df = len(df)
len_indices = len(indices)
len_cols = len(df.columns)
values = np.zeros((len_df, len_indices, len_cols))
for i in range(len_df):
for j, n in enumerate(indices):
idx = i + n
try:
assert idx >= 0 # avoid wrapping around
values[i, j] = df.iloc[idx]
except:
values[i, j] = np.nan
values
Returns
[[[nan, nan, nan],
[ 1, 2, 3],
[ 3, 6, 9],
[nan, nan, nan]],
[[nan, nan, nan],
[ 2, 4, 6],
[ 4, 8, 12],
[nan, nan, nan]],
[[ 1, 2, 3],
[ 3, 6, 9],
[nan, nan, nan],
[nan, nan, nan]],
[[ 2, 4, 6],
[ 4, 8, 12],
[nan, nan, nan],
[nan, nan, nan]]]
As desired, but is quite slow. Any suggestions?
As desired
Solution
Here's a solution that builds a lookup table:
# repeat indices into new dimension
rep_ids = np.tile(indices, (len(df), 1))
# add row index to the tiled ids
range_ids = rep_ids + np.arange(len(df))[:, np.newaxis]
# check which ids make a valid lookup in the df
# TODO: do you really mean > 0 here in the original code, not >= 0?
valid_test = (range_ids > 0) & (range_ids < len(df))
# lookup table, using 0 for invalid ids (will be overwritten later)
lookup_ids = np.where(valid_test, range_ids, 0)
# lookup all values
values = df.values[lookup_ids].astype(np.float64)
# set the invalids to nan
values[~valid_test] = np.nan
Answered By - w-m
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.