Saturday, July 16, 2022

[FIXED] Aligning batched sliding frame timeseries data for tensorflow/keras using timeseries_dataset_from_array and TimeseriesGenerator respectively

July 16, 2022 keras, machine-learning, python, tensorflow, training-data No comments

Issue

I have multiple input features and a singular target feature that correspond 1:1 to each other's index; meaning there should be no forward-looking or backward-looking when it comes to comparing inputs to targets: input[t] <=> target[t]. Essentially, I have already time-shifted my targets backwards to their corresponding input indexes for training purposes.

Under normal operating procedures, I would use N periods worth of past data in order to predict 1 future value, N periods ahead. As the frame shifts forward in time, each respective slot is filled with the [t+N] forecast, recorded at [t].

Now, based on whatever environment I'm developing in, I will need to use either timeseries_dataset_from_array or TimeseriesGenerator to batch my data (based on system support). I need to know if the implementation I made produces batches that will do what I expect when running model.fit() in keras. I'm unsure of whether or not keras is internally shifting data during fitting that I'm unaware of that might lead to poor results.

I'm using an LSTM potentially with the stateful argument so I need to ensure my batches are a perfect fit, and I also wanted to ensure the batch sizes are a factor of 2 (according to some posts regarding processor efficiency). I've tried implementing my own function for making this happen given a few additional assumptions regarding validation/test sizes. On the surface it appears that everything looks good, but since I'm unsure of keras' internals I don't know if I've made a blunder.

My question is whether or not I've properly aligned/batched the inputs and targets using timeseries_dataset_from_array/TimeseriesGenerator such that running model.fit() will train using losses/metrics that compare the target at time [t] with the predicted value at time [t] using inputs at time [t].

import pandas as pd
import numpy as np

use_ts_data = True
try:
    # Comment this line out if you want to test timeseries_dataset_from_array
    raise ImportError("No TDFA for you")
    from tensorflow.keras.preprocessing import timeseries_dataset_from_array as ts_data
except (ModuleNotFoundError, ImportError):
    from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator as ts_gen

    use_ts_data = False

def gp2(size):
    return np.power(2, int(np.log2((size))))

def train_validate_test_split(
    features, targets, train_size_ratio=0.5, max_batch_size=None, memory=1,
):
    def batch_size_with_buffer(buffer, available, desired, max_batch_size):
        batch_size = gp2(min(desired, max_batch_size or np.inf))
        if available < batch_size * 3 + buffer:
            # If we don't have enough records to support this batch_size, use 1 power lower
            batch_size = np.power(2, np.log(batch_size) / np.log(2) - 1)
        return int(batch_size)

    memory = max(1, memory)
    surplus = memory - 1
    test_size_ratio = 1 - train_size_ratio
    total_size = features.shape[0]
    smallest_size = int(total_size * test_size_ratio / 2)

    # Error on insufficient data
    def insufficient_data():
        raise RuntimeError(
            f"Insufficient data on which to split train/validation/test when ratio={train_size_ratio}%, nobs={total_size} and memory={memory}"
        )

    if total_size < memory + 3:
        insufficient_data()

    # Find greatest batch size that is a power of 2, that fits the smallest dataset size, and is no greater than max_batch_size
    batch_size = batch_size_with_buffer(
        surplus, total_size, smallest_size, max_batch_size
    )
    test_size = smallest_size - smallest_size % batch_size

    # Create/align the datasets
    if use_ts_data:
        index_offset = None
        
        start = -test_size
        X_test = features.iloc[start - surplus:]
        y_test = targets.iloc[start:]

        end = start
        start = end - test_size
        X_validation = features.iloc[start - surplus:end]
        y_validation = targets.iloc[start:end]

        end = start
        start = (total_size + end - surplus) % batch_size
        X_train = features.iloc[start:end]
        y_train = targets.iloc[start + surplus:end]
    else:
        index_offset = memory
        _features = features.shift(-1)
        
        start = -test_size - memory
        X_test = _features.iloc[start:]
        y_test = targets.iloc[start:]

        end = start + memory
        start = end - test_size - memory
        X_validation = _features.iloc[start:end]
        y_validation = targets.iloc[start:end]

        end = start + memory
        start = (total_size + end - memory) % batch_size
        X_train = _features.iloc[start:end]
        y_train = targets.iloc[start:end]

    # Record indexes
    test_index = y_test.index[index_offset:]
    validation_index = y_validation.index[index_offset:]
    train_index = y_train.index[index_offset:]
    
    if memory > X_train.shape[0] or memory > X_validation.shape[0]:
        insufficient_data()

    format_data = ts_data if use_ts_data else ts_gen
    train = format_data(X_train.values, y_train.values, memory, batch_size=batch_size)
    validation = format_data(
        X_validation.values, y_validation.values, memory, batch_size=batch_size
    )
    test = format_data(X_test.values, y_test.values, memory, batch_size=batch_size)

    # Print out the batched data for inspection
    def results(dataset, index):
        print("\n-------------------\n")
        print(f"Index:\n\n", index, "\n\n")
        last_i = len(dataset) - 1
        for i, batch in enumerate(dataset):
            inputs, targets = batch
            if i == 0:
                print(
                    f"First:\n\nInputs:\n",
                    inputs[0][-1],
                    "...",
                    inputs[-1][-1],
                    f"\n\nTargets:\n",
                    targets[0],
                    "...",
                    targets[-1],
                )
                print(inputs.shape, targets.shape, "\n\n")
            if i == last_i:
                print(
                    f"Last:\n\nInputs:\n",
                    inputs[0][-1],
                    "...",
                    inputs[-1][-1],
                    f"\n\nTargets:\n",
                    targets[0],
                    "...",
                    targets[-1],
                )
                print(inputs.shape, targets.shape, "\n\n")
        print("\n-------------------\n")

    results(train, train_index)
    results(validation, validation_index)
    results(test, test_index)

    return (
        batch_size,
        train,
        validation,
        test,
        train_index,
        validation_index,
        test_index,
    )

# inputs and targets are expected to be aligned (i.e., loss functions should subtract the predicted target@t from the actual target@t)
x = np.arange(101)
df = pd.DataFrame(index=x)
df['inputs'] = x
df['targets'] = x

batch_size, train, validation, test, train_index, validation_index, test_index = train_validate_test_split(df['inputs'], df['targets'], train_size_ratio=0.5, max_batch_size=2, memory=8)

Solution

All loss/metric functions rely on y_pred and y_true assume matching indices. There's nothing special that Keras does in the background.

Answered By - SnakeWasTheNameTheyGaveMe

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, July 16, 2022

[FIXED] Aligning batched sliding frame timeseries data for tensorflow/keras using timeseries_dataset_from_array and TimeseriesGenerator respectively

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels