Issue
I have a time series stored in a pandas dataframe. The time series is the sequence {X_t
} where t
is the time index that in my case it is truncated, i.e. t = [0,T>0].
For each t
, I would like to compute the following series
where the weights w_k
are defined by a recursive relation
and d
is a float parameter between 0 and 1.
Of course, given that my series is not infinite, this sum must be truncated. For example, for the last two terms of the time series, I will compute
I have tried to implement this computation in python, but it is terribly slow and I am sure I am not using the full potential of pandas/numpy. Can anyone suggest me a better way to do this computation?
First of all, I create a random dataset
import pandas as pd
import numpy as np
from tqdm import tqdm
df = pd.DataFrame(np.random.randint(0,100,size=(100000, 1)), columns=['value'])
then, I create a function that computes the weights iteratively
def get_next_weight(weight, k, d):
return - weight * (d-k+1)/k
weights = [1]
idx = 1
for idx in range(1,len(df)):
weights.append(get_next_weight(weights[-1], idx, 0.1))
Then, I compute the new series
new_values = []
with tqdm(total=len(df)) as pbar:
for idx in df.index:
Xt = (df.value.loc[:idx].sort_index(ascending=False).reset_index(drop=True) * weights[:idx+1]).sum()
new_values.append(Xt)
pbar.update(1)
This is really slow, and I know that my solution is very bad, but I couldn't come up with a better clean solution.
Any help?
Solution
2-stage optimization:
- weights: instead of loop/append/indexing create a numpy array of weights using list comprehension with calculating each weight on the fly
- "new" values: instead of locating/sorting the index of each consecutive slice of values, operate on raw numpy arrays reversing each slice of values with
[::-1]
df = pd.DataFrame(np.random.randint(0,100,size=(100000, 1)), columns=['value'])
def get_next_weight(weight, k, d):
return -weight * (d - k + 1)/k
w = 1
weights = np.array([w] + [(w := get_next_weight(w, i, 0.1)) for i in df.index[1:]])
new_values = [(df['value'].values[:i + 1][::-1] * weights[:i + 1]).sum()
for i in df.index]
Your initial approach was running in about 6 minutes on my machine, while my approach works out in about 7 seconds.
Sample result fragment:
print(new_values[:10])
[8.0, 6.2, 10.940000000000001, 1.2570000000000003, 31.795200000000005, 58.9494285, 49.66343665, 24.988227842500006, -6.947859277874997, 8.43163034610313]
Answered By - RomanPerekhrest
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.