Issue
I am following PyTorch tutorial on speech command recogniton and trying to implement my own recognition of 22 sentences in german language. In the tutorial they use padding for audio tensors, but for labels they use only torch.stack
. Because of that, I have an error, as I start training the network:
RuntimeError: stack expects each tensor to be equal size, but got [456] at entry 0 and [470] at entry 1
.
I do understand what this says, but since I am new to PyTorch can't unfortunately implement padding function for sentences from scratch. Therefore I would be happy if you could give me some hints and tipps for this.
Here is the code for collate_fn
and pad_sequence
functions:
def pad_sequence(batch):
# Make all tensor in a batch the same length by padding with zeros
batch = [item.t() for item in batch]
batch = torch.nn.utils.rnn.pad_sequence(batch, batch_first=True, padding_value=0.)
return batch.permute(0, 2, 1)
def collate_fn(batch):
# A data tuple has the form:
# waveform, label
tensors, targets = [], []
# Gather in lists, and encode labels as indices
for waveform, label in batch:
tensors += [waveform]
targets += [label]
# Group the list of tensors into a batched tensor
tensors = pad_sequence(tensors)
targets = torch.stack(targets)
return tensors, targets
Solution
As I started working directly with pad_sequence
, I understood how simple it works. So, in my case I needed only bunch of strings (batch
), which were automatically compared by PyTorch and extended to the maximal length of the one of the several strings in the batch.
My code looks now like this:
def pad_AudioSequence(batch):
# Make all tensor in a batch the same length by padding with zeros
batch = [item.t() for item in batch]
batch = torch.nn.utils.rnn.pad_sequence(batch, batch_first=True, padding_value=0.)
return batch.permute(0, 2, 1)
def pad_TextSequence(batch):
return torch.nn.utils.rnn.pad_sequence(batch,batch_first=True, padding_value=0)
def collate_fn(batch):
# A data tuple has the form:
# waveform, label
tensors, targets = [], []
# Gather in lists, and encode labels as indices
for waveform, label in batch:
tensors += [waveform]
targets += [label]
# Group the list of tensors into a batched tensor
tensors = pad_AudioSequence(tensors)
targets = pad_TextSequence(targets)
return tensors, targets
For those, who still don't understand how that works, here is little example:
encDecClass2 = dummyEncoderDecoder()
sent1 = audioWorkerClass.sentences[4] # wie viel Prozent hat der Akku noch?
sent2 = audioWorkerClass.sentences[5] # Wie spät ist es?
sent3 = audioWorkerClass.sentences[6] # Mach einen Timer für 5 Sekunden.
# encode sentences into tensor of numbers, representing words, using my own enc-dec class
sent1 = encDecClass2.encode(sent1) # tensor([11, 94, 21, 94, 22, 94, 23, 94, 24, 94, 25, 94, 26, 94, 15, 94])
sent2 = encDecClass2.encode(sent2) # tensor([27, 94, 28, 94, 12, 94, 29, 94, 15, 94])
sent3 = encDecClass2.encode(sent3) # tensor([30, 94, 31, 94, 32, 94, 33, 94, 34, 94, 35, 94, 19, 94])
print(sent1.shape) # torch.Size([16])
print(sent2.shape) # torch.Size([10])
print(sent3.shape) # torch.Size([14])
batch = []
# add sentences to the batch as separate arrays
batch +=[sent1]
batch +=[sent2]
batch +=[sent3]
output = pad_sequence(batch,batch_first=True, padding_value=0)
print(f"{output}\n{output.shape}")
#############################################################################
# output:
# tensor([[11, 94, 21, 94, 22, 94, 23, 94, 24, 94, 25, 94, 26, 94, 15, 94],
# [27, 94, 28, 94, 12, 94, 29, 94, 15, 94, 0, 0, 0, 0, 0, 0],
# [30, 94, 31, 94, 32, 94, 33, 94, 34, 94, 35, 94, 19, 94, 0, 0]])
# torch.Size([3, 16])
#############################################################################
As you may see all arrays were equalized to the maximum length of those three arrays and padded with zeros. Shape of the output is 3x16, because we had three sentences and longest array had sequence of 16 in the batch.
Answered By - Bogdan Khamelyuk
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.