Issue
I'm working through this fantastic book, rewriting the examples in PyTorch so that I better retain the material. My results have been comparable to the book's Keras code for most examples, but I'm having some trouble with this exercise. For those that have the book, this is on page 106.
The network used in the book to classify the text is as follows:
Book Code (Keras)
keras_model = keras.Sequential([
layers.Dense(64,activation='relu'),
layers.Dense(64,activation='relu'),
layers.Dense(46,activation='softmax'),
])
keras_model.compile(
optimizer = 'rmsprop',
loss = 'sparse_categorical_crossentropy',
metrics = ['accuracy']
)
hist = keras_model.fit(
partial_train_xs,
partial_train_ys,
epochs=20,
batch_size=512,
validation_data=[val_xs,val_ys]
)
My attempt at recreating the same in PyTorch:
model = nn.Sequential(
nn.Linear(10_000,64),
nn.ReLU(),
nn.Linear(64,64),
nn.ReLU(),
nn.Linear(64,46),
nn.Softmax()
)
def compute_val_loss(model,xs,ys):
preds = model(xs)
return(F.cross_entropy(preds,ys)).item()
def compute_accuracy(model,xs,ys):
preds = model(xs)
acc = (preds.argmax(dim=1) == ys).sum() / len(preds)
return acc.item()
def train_loop(model,xs,ys,epochs=20,lr=1e-3,opt=torch.optim.RMSprop,
batch_size=512,loss_func=F.cross_entropy):
opt = opt(model.parameters(),lr=lr)
losses = []
for i in range(epochs):
epoch_loss = []
for b in range(0,len(xs),batch_size):
xbatch = xs[b:b+batch_size]
ybatch = ys[b:b+batch_size]
logits = model(xbatch)
loss = loss_func(logits,ybatch)
model.zero_grad()
loss.backward()
opt.step()
epoch_loss.append(loss.item())
losses.append([i,sum(epoch_loss)/len(epoch_loss)])
print(loss.item())
return losses
I've excluded the data loading portion for brevity, but it's just "multihot" encoding the word sequences. For example if the vocab is 10k words, each input is a 10k vector where there is a "1" at every index in the vector corresponding to words index in the vocab.
My Question:
The problem I'm running into here is that there is a substantial divergence in results between the book's Keras version (which behaves as expected) and the PyTorch version. After 20 epochs, the Keras version has negligible training loss and is about 80% accurate on validaiton. The Torch version however barely moves on the training loss. For context it starts at about 3.4 and ends after 20 epochs at 3.1. The Keras version had a lower train loss than that after a single epoch (2.6).
The torch version did make progress on accuracy, though still lagging the Keras version. There is also a strange stair-step pattern to the accuracy:
What am I doing wrong in the Torch version? Or is there a legitimate reason for the divergence that is expected? There are minor hyperparameter differences in the RMSProp args of both libraries, but I fiddled around with those and didn't see much of a difference. Learning rate is equivalent for both. Even if I run the torch version for 150 epochs, the train/test loss continue to go down (very slowly) but validation accuracy peaks out around 75%.
Solution
After some more research (and sleep) I discovered that the loss function (cross entropy) expects raw logits in PyTorch, and in Keras it expects probabilities. You can set from_logits
= True in the Keras version to make them equivalent.
Removing the Softmax layer in the PyTorch version, I am getting roughly equivalent results.
Answered By - Solaxun
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.