Issue
I am attempting to fine-tune a BERT model on Google Colab from the Tensorflow Hub using this link.
However, I run into the following error:
InternalError: RET_CHECK failure (third_party/tensorflow/core/tpu/graph_rewrite/distributed_tpu_rewrite_pass.cc:2047) arg_shape.handle_type != DT_INVALID input edge: [id=2693 model_preprocessing_67660:0 -> cluster_train_function:628]
When I run my model.fit(...)
function.
This error only occurs when I try to use TPU (runs fine on CPU, but has a very long training time).
Here is my code for setting up the TPU and model:
TPU Setup:
import os
os.environ["TFHUB_MODEL_LOAD_FORMAT"]="UNCOMPRESSED"
cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(cluster_resolver)
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
strategy = tf.distribute.TPUStrategy(cluster_resolver)
Model Setup:
def build_classifier_model():
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessing_layer = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', name='preprocessing')
encoder_inputs = preprocessing_layer(text_input)
encoder = hub.KerasLayer('https://tfhub.dev/google/experts/bert/wiki_books/sst2/2', trainable=True, name='BERT_encoder')
outputs = encoder(encoder_inputs)
net = outputs['pooled_output']
net = tf.keras.layers.Dropout(0.1)(net)
net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
return tf.keras.Model(text_input, net)
Model Training
with strategy.scope():
bert_model = build_classifier_model()
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metrics = tf.metrics.BinaryAccuracy()
epochs = 1
steps_per_epoch = 1280000
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)
init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr,
num_train_steps=num_train_steps,
num_warmup_steps=num_warmup_steps,
optimizer_type='adamw')
bert_model.compile(optimizer=optimizer,
loss=loss,
metrics=metrics)
print(f'Training model')
history = bert_model.fit(x=X_train, y=y_train,
validation_data=(X_val, y_val),
epochs=epochs)
Note that X_train
is a numpy array of type str
with shape (1280000,)
and y_train
is a numpy array of shape (1280000, 1)
Solution
As I don't exactly know what changes you have made in the code... I don't have idea about your dataset. But I can see that you are trying to train the whole datset with one epoch and passing the steps per epoch directly. I would recommend to write it like this
set some batch_size 2^n power (for example 16 or 32 or etc) if you don't want to batch the dataset just set batch_size to 1
batch_size = 16
steps_per_epoch = training_data_size // batch_size
The problem with the code is most probably the training dataset size. I think that you're making a mistake by passing the value of the training dataset manually.
If you're loading the dataset from tfds use (as shown in the link):
train_dataset, train_data_size = load_dataset_from_tfds(
in_memory_ds, tfds_info, train_split, batch_size, bert_preprocess_model)
If you're using a custom dataset take the size of the cleaned dataset in a variable and then use that variable for using the size of the training data. Try to avoid manually putting values in the code as far as possible.
Answered By - Chinmay
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.