Issue
I am trying to build a regression model using Keras. Because I have a lot of data that cannot be loaded in memory, I am using tf.data.experimental.make_csv_dataset to create a dataset object.
The data needs to be normalized, and I think I understood how to normalize the features, but I can't find a proper way to normalize the labels.
I have the following code so far, with my training data being in csv files in the training_data
folder. In the csv files, columns 'a'
and 'b'
are the features, and 'labels'
are the labels, all numeric.
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense, Flatten
# Create the dataset
dataset = tf.data.experimental.make_csv_dataset(
file_pattern = "training_data/*.csv",
select_columns=['a', 'b', 'labels'],
label_name='labels',
batch_size=5, num_epochs=1,
num_parallel_reads=20,
shuffle_buffer_size=10000)
# Create a function for feature normalization
def get_normalization_layer(name, dataset):
# Create a Normalization layer for our feature.
normalizer = preprocessing.Normalization()
# Prepare a Dataset that only yields our feature.
feature_ds = dataset.map(lambda x, y: x[name])
# Learn the statistics of the data.
normalizer.adapt(feature_ds)
return normalizer
# Create a preprocessing layer for input
numerical_columns = []
for feature in ['a', 'b']:
normalizer = get_normalization_layer(feature, dataset)
num_col = tf.feature_column.numeric_column(feature, normalizer_fn=normalizer)
numerical_columns.append(num_col)
preprocessing_layer = tf.keras.layers.DenseFeatures(numerical_columns)
# Create and compile the model
model = Sequential()
model.add(preprocessing_layer)
model.add(Dense(20, activation='relu'))
model.add(Dense(20, activation='relu'))
model.compile(loss='mse', optimizer='adam', metrics=['mse'])
So in short, how do I normalize labels in a PrefetchDataset
?
Solution
You can do the following, you can use the from_generator
API for normalizing the labels. If you want you can normalize your features also with this approach. I am providing a pseudo code as I don't have your complete code with me, but you will get the gist of where I am going.
def gen():
# initialize enumerator
dataset_enum = dataset.enumerate()
i = 0
x = []
y = []
for element in dataset_enum.as_numpy_iterator():
# I am supposing that element[0] is x and element[1] is y.
if i % BATCH_SIZE == 0:
# normalize y
normalized_y = normalization_function(y)
x_features = x
x = []
y = []
yield x_features, normalized_y
x.append(element[0])
y.append(element[1])
new_dataset = tf.data.Dataaset.from_generator(
gen,
output_signature=(
# here what you expect to have as an output.
# Look at this to have a better idea https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator
)
)
Feel free to ask questions.
Answered By - pratsbhatt
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.