Sunday, April 3, 2022

[FIXED] Stack expects each tensor to be equal size

April 03, 2022 pytorch, speech-recognition No comments

Issue

I am following PyTorch tutorial on speech command recogniton and trying to implement my own recognition of 22 sentences in german language. In the tutorial they use padding for audio tensors, but for labels they use only torch.stack. Because of that, I have an error, as I start training the network:

RuntimeError: stack expects each tensor to be equal size, but got [456] at entry 0 and [470] at entry 1.

I do understand what this says, but since I am new to PyTorch can't unfortunately implement padding function for sentences from scratch. Therefore I would be happy if you could give me some hints and tipps for this.

Here is the code for collate_fn and pad_sequence functions:

def pad_sequence(batch):
    # Make all tensor in a batch the same length by padding with zeros
    batch = [item.t() for item in batch]
    batch = torch.nn.utils.rnn.pad_sequence(batch, batch_first=True, padding_value=0.)
    return batch.permute(0, 2, 1)


def collate_fn(batch):
    # A data tuple has the form:
    # waveform,  label
    tensors, targets = [], []

    # Gather in lists, and encode labels as indices
    for waveform, label in batch:
        tensors += [waveform]
        targets += [label]

    # Group the list of tensors into a batched tensor
    tensors = pad_sequence(tensors)
    targets = torch.stack(targets)

    return tensors, targets

Solution

As I started working directly with pad_sequence, I understood how simple it works. So, in my case I needed only bunch of strings (batch), which were automatically compared by PyTorch and extended to the maximal length of the one of the several strings in the batch.

My code looks now like this:

def pad_AudioSequence(batch):
  # Make all tensor in a batch the same length by padding with zeros
  batch = [item.t() for item in batch]
  batch = torch.nn.utils.rnn.pad_sequence(batch, batch_first=True, padding_value=0.)
  return batch.permute(0, 2, 1)

def pad_TextSequence(batch):
  return torch.nn.utils.rnn.pad_sequence(batch,batch_first=True, padding_value=0)

def collate_fn(batch):
  # A data tuple has the form:
  # waveform,  label
  tensors, targets = [], []
  # Gather in lists, and encode labels as indices
  for waveform, label in batch:
      tensors += [waveform]
      targets += [label]
  # Group the list of tensors into a batched tensor
  tensors = pad_AudioSequence(tensors)
  targets = pad_TextSequence(targets)
  return tensors, targets

For those, who still don't understand how that works, here is little example:

encDecClass2 = dummyEncoderDecoder()
sent1 = audioWorkerClass.sentences[4] # wie viel Prozent hat der Akku noch?
sent2 = audioWorkerClass.sentences[5] # Wie spät ist es?
sent3 = audioWorkerClass.sentences[6] # Mach einen Timer für 5 Sekunden.

# encode sentences into tensor of numbers, representing words, using my own enc-dec class
sent1 = encDecClass2.encode(sent1) # tensor([11, 94, 21, 94, 22, 94, 23, 94, 24, 94, 25, 94, 26, 94, 15, 94])
sent2 = encDecClass2.encode(sent2) # tensor([27, 94, 28, 94, 12, 94, 29, 94, 15, 94])
sent3 = encDecClass2.encode(sent3) # tensor([30, 94, 31, 94, 32, 94, 33, 94, 34, 94, 35, 94, 19, 94])

print(sent1.shape) # torch.Size([16])
print(sent2.shape) # torch.Size([10])
print(sent3.shape) # torch.Size([14])

batch = []
# add sentences to the batch as separate arrays
batch +=[sent1]
batch +=[sent2]
batch +=[sent3]

output = pad_sequence(batch,batch_first=True, padding_value=0)

print(f"{output}\n{output.shape}")

#############################################################################
# output:
# tensor([[11, 94, 21, 94, 22, 94, 23, 94, 24, 94, 25, 94, 26, 94, 15, 94],
#         [27, 94, 28, 94, 12, 94, 29, 94, 15, 94,  0,  0,  0,  0,  0,  0],
#         [30, 94, 31, 94, 32, 94, 33, 94, 34, 94, 35, 94, 19, 94,  0,  0]])
# torch.Size([3, 16])
#############################################################################

As you may see all arrays were equalized to the maximum length of those three arrays and padded with zeros. Shape of the output is 3x16, because we had three sentences and longest array had sequence of 16 in the batch.

Answered By - Bogdan Khamelyuk

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, April 3, 2022

[FIXED] Stack expects each tensor to be equal size

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels