Issue
I am creating a captcha image recognition system. It first extracts the features of the images with ResNet and then uses LSTM to recognize the words and letter in the image. An fc layer is supposed to connect the two. I have not designed a LSTM model before and am very new to machine learning, so I am pretty confused and overwhelmed by this.
I am confused enough that I am not even totally sure what questions I should ask. But here are a couple things that stand out to me:
- What is the purpose of embedding the captions if the captcha images are all randomized?
- Is the linear fc layer in the first part of the for loop the correct way to connect the CNN feature vectors to the LSTM?
- Is this a correct use of the LSTM cell in the LSTM?
And, in general, if there are any suggestions of general directions to look into, that would be really appreciated.
So far, I have:
class LSTM(nn.Module):
def __init__(self, cnn_dim, hidden_size, vocab_size, num_layers=1):
super(LSTM, self).__init__()
self.cnn_dim = cnn_dim #i think this is the input size
self.hidden_size = hidden_size
self.vocab_size = vocab_size #i think this should be the output size
# Building your LSTM cell
self.lstm_cell = nn.LSTMCell(input_size=self.vocab_size, hidden_size=hidden_size)
'''Connect CNN model to LSTM model'''
# output fully connected layer
# CNN does not necessarily need the FCC layers, in this example it is just extracting the features, that gets set to the LSTM which does the actual processing of the features
self.fc_in = nn.Linear(cnn_dim, vocab_size) #this takes the input from the CNN takes the features from the cnn #cnn_dim = 512, hidden_size = 128
self.fc_out = nn.Linear(hidden_size, vocab_size) # this is the looper in the LSTM #I think this is correct?
# embedding layer
self.embed = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.vocab_size)
# activations
self.softmax = nn.Softmax(dim=1)
def forward(self, features, captions):
#features: extracted features from ResNet
#captions: label of images
batch_size = features.size(0)
cnn_dim = features.size(1)
hidden_state = torch.zeros((batch_size, self.hidden_size)).cuda() # Initialize hidden state with zeros
cell_state = torch.zeros((batch_size, self.hidden_size)).cuda() # Initialize cell state with zeros
outputs = torch.empty((batch_size, captions.size(1), self.vocab_size)).cuda()
captions_embed = self.embed(captions)
'''Design LSTM model for captcha image recognition'''
# Pass the caption word by word for each time step
# It receives an input(x), makes an output(y), and receives this output as an input again recurrently
'''Defined hidden state, cell state, outputs, embedded captions'''
# can be designed to be word by word or character by character
for t in range(captions).size(1):
# for the first time step the input is the feature vector
if t == 0:
# probably have to get the output from the ResNet layer
# use the LSTM cells in here i presume
x = self.fc_in(features)
hidden_state, cell_state = self.lstm_cell(x[t], (hidden_state, cell_state))
x = self.fc_out(hidden_state)
outputs.append(hidden_state)
# for the 2nd+ time steps
else:
hidden_state, cell_state = self.lstm_cell(x[t], (hidden_state, cell_state))
x = self.fc_out(hidden_state)
outputs.append(hidden_state)
# build the output tensor
outputs = torch.stack(outputs,dim=0)
return outputs
Solution
nn.Embedding()
is usually used to transfer a sparse one-hot vector to a dense vector (e.g. transfer 'a' to [0.1,0.2,...]) for computation practically. I do not understand why you try to embed captions, which looks like ground-truth. If you want to compute loss with that, trynn.CTCLoss()
.- If you are going to send a string to LSTM, it is recommended to embed characters in the string with
nn.Embedding()
firstly, which makes them dense and computational-practical. But if the inputs of LSTM is something extracted from CNN (or other modules), it is already dense and computational-practical and not necessary to project them withfc_in
from my view. - I often use
nn.LSTM()
instead ofnn.LSTMCell()
, for the latter is troublesome.
There are some bugs in your code and I fixed them:
import torch
from torch import nn
class LSTM(nn.Module):
def __init__(self, cnn_dim, hidden_size, vocab_size, num_layers=1):
super(LSTM, self).__init__()
self.cnn_dim = cnn_dim # i think this is the input size
self.hidden_size = hidden_size
self.vocab_size = vocab_size # i think this should be the output size
# Building your LSTM cell
self.lstm_cell = nn.LSTMCell(input_size=self.vocab_size, hidden_size=hidden_size)
'''Connect CNN model to LSTM model'''
# output fully connected layer
# CNN does not necessarily need the FCC layers, in this example it is just extracting the features, that gets set to the LSTM which does the actual processing of the features
self.fc_in = nn.Linear(cnn_dim,
vocab_size) # this takes the input from the CNN takes the features from the cnn #cnn_dim = 512, hidden_size = 128
self.fc_out = nn.Linear(hidden_size,
vocab_size) # this is the looper in the LSTM #I think this is correct?
# embedding layer
self.embed = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.vocab_size)
# activations
self.softmax = nn.Softmax(dim=1)
def forward(self, features, captions):
# features: extracted features from ResNet
# captions: label of images
batch_size = features.size(0)
cnn_dim = features.size(1)
hidden_state = torch.zeros((batch_size, self.hidden_size)).cuda() # Initialize hidden state with zeros
cell_state = torch.zeros((batch_size, self.hidden_size)).cuda() # Initialize cell state with zeros
# outputs = torch.empty((batch_size, captions.size(1), self.vocab_size)).cuda()
outputs = torch.Tensor([]).cuda()
captions_embed = self.embed(captions)
'''Design LSTM model for captcha image recognition'''
# Pass the caption word by word for each time step
# It receives an input(x), makes an output(y), and receives this output as an input again recurrently
'''Defined hidden state, cell state, outputs, embedded captions'''
# can be designed to be word by word or character by character
# for t in range(captions).size(1):
for t in range(captions.size(1)):
# for the first time step the input is the feature vector
if t == 0:
# probably have to get the output from the ResNet layer
# use the LSTM cells in here i presume
x = self.fc_in(features)
# hidden_state, cell_state = self.lstm_cell(x[t], (hidden_state, cell_state))
hidden_state, cell_state = self.lstm_cell(x, (hidden_state, cell_state))
x = self.fc_out(hidden_state)
# outputs.append(hidden_state)
outputs = torch.cat([outputs, hidden_state])
# for the 2nd+ time steps
else:
# hidden_state, cell_state = self.lstm_cell(x[t], (hidden_state, cell_state))
hidden_state, cell_state = self.lstm_cell(x, (hidden_state, cell_state))
x = self.fc_out(hidden_state)
# outputs.append(hidden_state)
outputs = torch.cat([outputs, hidden_state])
# build the output tensor
# outputs = torch.stack(outputs, dim=0)
return outputs
m = LSTM(16, 32, 10)
m = m.cuda()
features = torch.randn((2, 16))
features = features.cuda()
captions = torch.randn((2, 10))
captions = torch.clip(captions, 0, 9)
captions = captions.long()
captions = captions.cuda()
m(features, captions)
This paper may help you somewhat: https://arxiv.org/abs/1904.01906
Answered By - Depressant
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.