Sunday, November 12, 2023

[FIXED] Optimising the computation of a (quasi) infinite series on Pandas/Numpy

November 12, 2023 numpy, optimization, pandas, python No comments

Issue

I have a time series stored in a pandas dataframe. The time series is the sequence {X_t} where t is the time index that in my case it is truncated, i.e. t = [0,T>0].

For each t, I would like to compute the following series

where the weights w_k are defined by a recursive relation

and d is a float parameter between 0 and 1.

Of course, given that my series is not infinite, this sum must be truncated. For example, for the last two terms of the time series, I will compute

I have tried to implement this computation in python, but it is terribly slow and I am sure I am not using the full potential of pandas/numpy. Can anyone suggest me a better way to do this computation?

First of all, I create a random dataset

import pandas as pd
import numpy as np
from tqdm import tqdm


df = pd.DataFrame(np.random.randint(0,100,size=(100000, 1)), columns=['value'])

then, I create a function that computes the weights iteratively

def get_next_weight(weight, k, d):
    return - weight * (d-k+1)/k

weights = [1]
idx = 1
for idx in range(1,len(df)):
    weights.append(get_next_weight(weights[-1], idx, 0.1))

Then, I compute the new series

new_values = []

with tqdm(total=len(df)) as pbar:
    for idx in df.index:
        Xt = (df.value.loc[:idx].sort_index(ascending=False).reset_index(drop=True) * weights[:idx+1]).sum()
        new_values.append(Xt)
        pbar.update(1)

This is really slow, and I know that my solution is very bad, but I couldn't come up with a better clean solution.

Any help?

Solution

2-stage optimization:

weights: instead of loop/append/indexing create a numpy array of weights using list comprehension with calculating each weight on the fly
"new" values: instead of locating/sorting the index of each consecutive slice of values, operate on raw numpy arrays reversing each slice of values with [::-1]

df = pd.DataFrame(np.random.randint(0,100,size=(100000, 1)), columns=['value'])

def get_next_weight(weight, k, d):
    return -weight * (d - k + 1)/k

w = 1
weights = np.array([w] + [(w := get_next_weight(w, i, 0.1)) for i in df.index[1:]])

new_values = [(df['value'].values[:i + 1][::-1] * weights[:i + 1]).sum()
              for i in df.index]

Your initial approach was running in about 6 minutes on my machine, while my approach works out in about 7 seconds.

Sample result fragment:

print(new_values[:10])

[8.0, 6.2, 10.940000000000001, 1.2570000000000003, 31.795200000000005, 58.9494285, 49.66343665, 24.988227842500006, -6.947859277874997, 8.43163034610313]

Answered By - RomanPerekhrest

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, November 12, 2023

[FIXED] Optimising the computation of a (quasi) infinite series on Pandas/Numpy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels