I´m trying to create a Siamese model with Keras
which learns to recognize differences in Mel-Spectrograms.
The dataset I´m using is the ESC-50 dataset.
I split it in training files (40 classes a 40 files) and test files (5 classes a 40 files).
I generate positive and negative pairs.
I generate Mel-Spectrograms with 64 Mel-Bands -> shape of mel-spectrogram (64,626)
For example the arrays feat_train_1
and feat_train_2
are of the shape (3200,64,626).
I this picture the first two spectrograms are feat_train_1[i]
and feat_train_2[i]
pair_labels_train[i]=1 (positive pair)
The 3rd and 4th spectrogram are feat_train_1[i+1]
and feat_train_2[i+1]
with pair_labels_train[i+1]=0
I then expand the feature arrays with a channel dimension and broadcast them to 3 channels.
I´m using the VGG16 network to extract embeddings out of the features. The euclidian distance of the two embeddings gets calculated.
The problem is that the accuracy (as well as val_accuarcy) is stuck at 50% while the loss slowly decreases. You can see the whole script here:
#Load the audio files and split them
audio_data, labels = utiltiy_functions.read_audio_files('esc-50-master/audio_conv', 'esc-50-master/meta')
idx_training, idx_test, idx_eval = utiltiy_functions.split_data(audio_data, labels)
pair_idx_train , pair_labels_train = utiltiy_functions.generate_pairs(labels, idx_training)
pair_idx_test , pair_labels_test = utiltiy_functions.generate_pairs(labels, idx_test)
pair_idx_eval , pair_labels_eval = utiltiy_functions.generate_pairs(labels, idx_eval)
audio_data_train_1 = audio_data[pair_idx_train[:,0]]
audio_data_train_2 = audio_data[pair_idx_train[:,1]]
audio_data_test_1 = audio_data[pair_idx_test[:,0]]
audio_data_test_2 = audio_data[pair_idx_test[:,1]]
audio_data_eval_1 = audio_data[pair_idx_eval[:,0]]
audio_data_eval_2 = audio_data[pair_idx_eval[:,1]]
#Calculate Features and reshape
def get_librosa_melspecs(audio_array, name):
melspecs = np.zeros((audio_array.shape[0],64,626))
for i,audio in enumerate(audio_array):
mel = librosa.feature.melspectrogram(y=audio, n_mels=64, n_fft = 1024, hop_length=128, sr=16000)
mel[mel!=0] = np.log(mel[mel!=0])
#melnormalized = librosa.util.normalize(mellog)
melspecs[i]=mel, melspecs)
return melspecs
feat_test_1 = get_librosa_melspecs(audio_data_test_1, "features_vgg_test1.npy")
feat_test_2 = get_librosa_melspecs(audio_data_test_2, "features_vgg_test2.npy")
feat_train_1 = get_librosa_melspecs(audio_data_train_1, "features_vgg_train1.npy")
feat_train_2 = get_librosa_melspecs(audio_data_train_2, "features_vgg_train2.npy")
feat_test_1 = np.expand_dims(feat_test_1, 3)
feat_test_2 = np.expand_dims(feat_test_2, 3)
feat_train_1 = np.expand_dims(feat_train_1, 3)
feat_train_2 = np.expand_dims(feat_train_2, 3)
feat_test_1 = tf.reshape(tf.broadcast_to(feat_test_1, (400,64,626,3)), (400,64,626,3))
feat_test_2 = tf.reshape(tf.broadcast_to(feat_test_2, (400,64,626,3)), (400,64,626,3))
feat_train_1 = tf.reshape(tf.broadcast_to(feat_train_1, (3200,64,626,3)), (3200,64,626,3))
feat_train_2 = tf.reshape(tf.broadcast_to(feat_train_2, (3200,64,626,3)), (3200,64,626,3))
#Build siamese net
feat_1 = Input(shape=(64,626,3))
feat_2 = Input(shape=(64,626,3))
model_vgg = VGG16(weights="imagenet", include_top=False, input_shape=(64,626,3))
for layer in model_vgg.layers:
layer.trainable = True
pre_emb1 = model_vgg(feat_1)
pre_emb2 = model_vgg(feat_2)
#flatten and dense layers
flatten = Flatten()
dense_1 = Dense(4096, activation="relu")
dense_2 = Dense(4096, activation="relu")
dense_3 = Dense(512, activation="relu")
flatten1 = flatten(pre_emb1)
flatten2 = flatten(pre_emb2)
dense1_1 = dense_1(flatten1)
dense2_1 = dense_1(flatten2)
dense1_2 = dense_2(dense1_1)
dense2_2 = dense_2(dense2_1)
dense1_3 = dense_3(dense1_2)
dense2_3 = dense_3(dense2_2)
distance = Lambda(utiltiy_functions.eucl_distance)([dense1_3, dense2_3])
#Output Layer
outputs = Dense(1, activation="sigmoid")(distance)
#model definition
model = Model(inputs=[feat_1, feat_2], outputs=outputs)
opt = Adam(learning_rate=0.001)
model.compile(loss="binary_crossentropy", optimizer=opt, metrics=["accuracy"])
early_stopping = EarlyStopping(monitor='val_loss', patience=3, mode='auto', restore_best_weights=True)
#Model trainieren
print("Siamesisches Model trainieren.\n")
[feat_train_1[:], feat_train_2[:]], pair_labels_train[:],
validation_data=([feat_test_1[:], feat_test_2[:]], pair_labels_test[:]),
2022-07-01 12:33:55.261913: W tensorflow/stream_executor/platform/default/] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-07-01 12:33:55.262999: I tensorflow/stream_executor/cuda/] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
WAV-Dateien einlesen...
WAV-Dateien splitten...
Trainings-Paare bilden...
Test-Paare bilden...
Evaluierungs-Paare bilden...
Berechnete Features aus Dateien laden...
2022-07-01 12:34:56.888225: W tensorflow/stream_executor/platform/default/] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-07-01 12:34:56.895846: W tensorflow/stream_executor/platform/default/] Could not load dynamic library 'cublas64_11.dll'; dlerror: cublas64_11.dll not found
2022-07-01 12:34:56.896817: W tensorflow/stream_executor/platform/default/] Could not load dynamic library 'cublasLt64_11.dll'; dlerror: cublasLt64_11.dll not found
2022-07-01 12:34:56.897344: W tensorflow/stream_executor/platform/default/] Could not load dynamic library 'cufft64_10.dll'; dlerror: cufft64_10.dll not found
2022-07-01 12:34:56.898131: W tensorflow/stream_executor/platform/default/] Could not load dynamic library 'curand64_10.dll'; dlerror: curand64_10.dll not found
2022-07-01 12:34:56.898859: W tensorflow/stream_executor/platform/default/] Could not load dynamic library 'cusolver64_11.dll'; dlerror: cusolver64_11.dll not found
2022-07-01 12:34:56.900598: W tensorflow/stream_executor/platform/default/] Could not load dynamic library 'cusparse64_11.dll'; dlerror: cusparse64_11.dll not found
2022-07-01 12:34:56.903538: W tensorflow/stream_executor/platform/default/] Could not load dynamic library 'cudnn64_8.dll'; dlerror: cudnn64_8.dll not found
2022-07-01 12:34:56.904142: W tensorflow/core/common_runtime/gpu/] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2022-07-01 12:34:56.940548: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-01 12:34:57.018051: W tensorflow/core/framework/] Allocation of 384614400 exceeds 10% of free system memory.
2022-07-01 12:34:57.230756: W tensorflow/core/framework/] Allocation of 384614400 exceeds 10% of free system memory.
2022-07-01 12:34:57.592132: W tensorflow/core/framework/] Allocation of 3076915200 exceeds 10% of free system memory.
2022-07-01 12:35:06.909754: W tensorflow/core/framework/] Allocation of 3076915200 exceeds 10% of free system memory.
Siamesisches Netzwerk erstellen...
Model: "vgg16"
Layer (type) Output Shape Param #
input_3 (InputLayer) [(None, 64, 626, 3)] 0
block1_conv1 (Conv2D) (None, 64, 626, 64) 1792
block1_conv2 (Conv2D) (None, 64, 626, 64) 36928
block1_pool (MaxPooling2D) (None, 32, 313, 64) 0
block2_conv1 (Conv2D) (None, 32, 313, 128) 73856
block2_conv2 (Conv2D) (None, 32, 313, 128) 147584
block2_pool (MaxPooling2D) (None, 16, 156, 128) 0
block3_conv1 (Conv2D) (None, 16, 156, 256) 295168
block3_conv2 (Conv2D) (None, 16, 156, 256) 590080
block3_conv3 (Conv2D) (None, 16, 156, 256) 590080
block3_pool (MaxPooling2D) (None, 8, 78, 256) 0
block4_conv1 (Conv2D) (None, 8, 78, 512) 1180160
block4_conv2 (Conv2D) (None, 8, 78, 512) 2359808
block4_conv3 (Conv2D) (None, 8, 78, 512) 2359808
block4_pool (MaxPooling2D) (None, 4, 39, 512) 0
block5_conv1 (Conv2D) (None, 4, 39, 512) 2359808
block5_conv2 (Conv2D) (None, 4, 39, 512) 2359808
block5_conv3 (Conv2D) (None, 4, 39, 512) 2359808
block5_pool (MaxPooling2D) (None, 2, 19, 512) 0
Total params: 14,714,688
Trainable params: 14,714,688
Non-trainable params: 0
2022-07-01 12:35:22.586993: W tensorflow/core/framework/] Allocation of 318767104 exceeds 10% of free system memory.
Model: "model"
Layer (type) Output Shape Param # Connected to
input_1 (InputLayer) [(None, 64, 626, 3) 0 []
input_2 (InputLayer) [(None, 64, 626, 3) 0 []
vgg16 (Functional) (None, 2, 19, 512) 14714688 ['input_1[0][0]',
flatten (Flatten) (None, 19456) 0 ['vgg16[0][0]',
dense (Dense) (None, 4096) 79695872 ['flatten[0][0]',
dense_1 (Dense) (None, 4096) 16781312 ['dense[0][0]',
dense_2 (Dense) (None, 512) 2097664 ['dense_1[0][0]',
lambda (Lambda) (None, 1) 0 ['dense_2[0][0]',
dense_3 (Dense) (None, 1) 2 ['lambda[0][0]']
Total params: 113,289,538
Trainable params: 113,289,538
Non-trainable params: 0
Siamesisches Netzwerk traineren...
Siamesisches Model trainieren.
Epoch 1/10
1/160 [..............................] - ETA: 1:43:33 - loss: 1.6736 - accuracy: 0.5000
156/160 [============================>.] - ETA: 2:02 - loss: 0.7549 - accuracy: 0.4622
157/160 [============================>.] - ETA: 1:31 - loss: 0.7550 - accuracy: 0.4608
158/160 [============================>.] - ETA: 1:01 - loss: 0.7546 - accuracy: 0.4604
159/160 [============================>.] - ETA: 30s - loss: 0.7542 - accuracy: 0.4613
160/160 [==============================] - ETA: 0s - loss: 0.7538 - accuracy: 0.4619
160/160 [==============================] - 5059s 32s/step - loss: 0.7538 - accuracy: 0.4619 - val_loss: 0.7172 - val_accuracy: 0.4725
Epoch 2/10
1/160 [..............................] - ETA: 1:20:48 - loss: 0.7224 - accuracy: 0.4500
2/160 [..............................] - ETA: 1:19:53 - loss: 0.7171 - accuracy: 0.4500
3/160 [..............................] - ETA: 1:19:26 - loss: 0.7145 - accuracy: 0.4500
4/160 [..............................] - ETA: 1:19:04 - loss: 0.7090 - accuracy: 0.4875
5/160 [..............................] - ETA: 1:18:32 - loss: 0.7086 - accuracy: 0.4600
6/160 [>.............................] - ETA: 1:18:34 - loss: 0.7055 - accuracy: 0.4750
155/160 [============================>.] - ETA: 2:33 - loss: 0.7006 - accuracy: 0.4677
156/160 [============================>.] - ETA: 2:02 - loss: 0.7005 - accuracy: 0.4683
157/160 [============================>.] - ETA: 1:32 - loss: 0.7005 - accuracy: 0.4688
158/160 [============================>.] - ETA: 1:01 - loss: 0.7004 - accuracy: 0.4690
159/160 [============================>.] - ETA: 30s - loss: 0.7004 - accuracy: 0.4682
160/160 [==============================] - ETA: 0s - loss: 0.7003 - accuracy: 0.4694
160/160 [==============================] - 5075s 32s/step - loss: 0.7003 - accuracy: 0.4694 - val_loss: 0.6932 - val_accuracy: 0.5000
Epoch 3/10
1/160 [..............................] - ETA: 1:21:04 - loss: 0.6919 - accuracy: 0.5500
2/160 [..............................] - ETA: 1:20:37 - loss: 0.6933 - accuracy: 0.5000
3/160 [..............................] - ETA: 1:19:54 - loss: 0.6932 - accuracy: 0.5000
4/160 [..............................] - ETA: 1:19:19 - loss: 0.6940 - accuracy: 0.4750
5/160 [..............................] - ETA: 1:18:47 - loss: 0.6935 - accuracy: 0.4900
6/160 [>.............................] - ETA: 1:18:13 - loss: 0.6932 - accuracy: 0.5000852
62/160 [==========>...................] - ETA: 49:52 - loss: 0.6936 - accuracy: 0.4839
63/160 [==========>...................] - ETA: 49:21 - loss: 0.6937 - accuracy: 0.4825
64/160 [===========>..................] - ETA: 48:50 - loss: 0.6936 - accuracy: 0.4844
65/160 [===========>..................] - ETA: 48:20 - loss: 0.6936 - accuracy: 0.4854
66/160 [===========>..................] - ETA: 47:49 - loss: 0.6936 - accuracy: 0.4848
67/160 [===========>..................] - ETA: 47:18 - loss: 0.6936 - accuracy: 0.4836
68/160 [===========>..................] - ETA: 46:48 - loss: 0.6936 - accuracy: 0.4838
69/160 [===========>..................] - ETA: 46:17 - loss: 0.6936 - accuracy: 0.4855
The accuarcy wont change. The loss slowly decreases. I´ve tried training it for several hours.
I´ve already tried using different losses like constrastive
loss and different networks like MobileNet
or VGGish
Its always stuck at 50%.
I hope you can help me. Since this is my first post here feel free to ask more questions.
I could change that by changing the last activation function from sigmoid to relu:
#Output Layer
outputs = Dense(1, activation="relu")(distance)
Answered By - logame
Post a Comment
Note: Only a member of this blog may post a comment.