Issue
I'm trying to preprocess a dataset for a neuronal network. Therefore, I need to reshape an array with the shape (2040906, 1) into an array of batches.
I need a batch size around 1440 rows but 2040906 is not dividable (with a remainder of zero) by that number obviously.
I tried to just calculate the modulo of the division and drop as many rows as the remainder so the division will result in a modulo of zero. But dropping rows of my dataset is not what I want to do.
So this is an example snippet to reproduce the problem.
import numpy as np
x = np.ones((2040906, 1))
np.split(x, 1440)
The perfect solution for me would be some kind of function, that returns the nearest divisor for a given value that has a remainder of 0.
Solution
Looking for the largest divisor is not a good approach because of two reasons.
- The size of array might be prime number.
- The divisor may be too large or too small resulting in ineffective learning.
The better idea is to pad dataset with samples randomly selected from the whole dataset to make it divisible by optimal batch size. Here is the simple trick to compute the size of padded array divisible by 1440
(-x.shape[0] % 1440) + x.shape[0]
However, when data is ordered (like time series) then padding cannot be used because there no way to construct representative content of padding data.
The alternative solution would be minimization of truncated data. One can search through a range a available padding to find requires minimal truncation.
def find_best_divisor(size, low, high, step=1):
minimal_truncation, best_divisor = min((size % divisor, divisor)
for divisor in range(low, high, step))
return best_divisor
This approach is nice because it allows to utilize data well and use padding suitable for training.
Answered By - tstanisl
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.