Saturday, January 29, 2022

[FIXED] How do I normalize labels using tf.data?

January 29, 2022 keras, python, tensorflow No comments

Issue

I am trying to build a regression model using Keras. Because I have a lot of data that cannot be loaded in memory, I am using tf.data.experimental.make_csv_dataset to create a dataset object.

The data needs to be normalized, and I think I understood how to normalize the features, but I can't find a proper way to normalize the labels.

I have the following code so far, with my training data being in csv files in the training_data folder. In the csv files, columns 'a' and 'b'are the features, and 'labels' are the labels, all numeric.

import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense, Flatten

# Create the dataset
dataset = tf.data.experimental.make_csv_dataset(
    file_pattern = "training_data/*.csv",
    select_columns=['a', 'b', 'labels'], 
    label_name='labels',
    batch_size=5, num_epochs=1,
    num_parallel_reads=20,
    shuffle_buffer_size=10000)

# Create a function for feature normalization
def get_normalization_layer(name, dataset):
    # Create a Normalization layer for our feature.
    normalizer = preprocessing.Normalization()

    # Prepare a Dataset that only yields our feature.
    feature_ds = dataset.map(lambda x, y: x[name])

    # Learn the statistics of the data.
    normalizer.adapt(feature_ds)

    return normalizer

# Create a preprocessing layer for input
numerical_columns = []
for feature in ['a', 'b']:
    normalizer = get_normalization_layer(feature, dataset)
    num_col = tf.feature_column.numeric_column(feature, normalizer_fn=normalizer)
    numerical_columns.append(num_col)

preprocessing_layer = tf.keras.layers.DenseFeatures(numerical_columns)

# Create and compile the model
model = Sequential()
model.add(preprocessing_layer)
model.add(Dense(20, activation='relu'))
model.add(Dense(20, activation='relu'))
model.compile(loss='mse', optimizer='adam', metrics=['mse'])

So in short, how do I normalize labels in a PrefetchDataset?

Solution

You can do the following, you can use the from_generator API for normalizing the labels. If you want you can normalize your features also with this approach. I am providing a pseudo code as I don't have your complete code with me, but you will get the gist of where I am going.

def gen():
  # initialize enumerator
  dataset_enum = dataset.enumerate()
  i = 0
  x = []
  y = []
  for element in dataset_enum.as_numpy_iterator():
    # I am supposing that element[0] is x and element[1] is y.
    if i % BATCH_SIZE == 0:
      # normalize y
      normalized_y = normalization_function(y)
      x_features = x
      x = []
      y = []
      yield x_features, normalized_y
      
    x.append(element[0])
    y.append(element[1])


new_dataset = tf.data.Dataaset.from_generator(
    gen,
    output_signature=(
        # here what you expect to have as an output.
        # Look at this to have a better idea https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator
    )
)

Feel free to ask questions.

Answered By - pratsbhatt

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 29, 2022

[FIXED] How do I normalize labels using tf.data?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels