Issue
I have created a tiny dataset where an exact linear relationship holds. The code is as follows:
import numpy as np
def gen_data(n, k):
np.random.seed(5711)
beta = np.random.uniform(0, 1, size=(k, 1))
print("beta is:", beta)
X = np.random.normal(size=(n, k))
y = X.dot(beta).reshape(-1, 1)
D = np.concatenate([X, y], axis=1)
return D.astype(np.float32)
Now I have fitted a pyTorch neural network with SGD optimizer and MSE-loss and it converged approximately to the true values within 50 epochs and a learning rate of 1e-1
I tried to setup exactly the same model in tensorflow:
import keras.layers
from sklearn.model_selection import train_test_split
from keras.models import Sequential
import tensorflow as tf
n = 10
k = 2
X = gen_data(n, k)
D_train, D_test = train_test_split(X, test_size=0.2)
X_train, y_train = D_train[:,:k], D_train[:,k:]
X_test, y_test = D_test[:,:k], D_test[:,k:]
model = Sequential([keras.layers.Dense(1)])
model.compile(optimizer=tf.keras.optimizers.SGD(lr=1e-1), loss=tf.keras.losses.mean_squared_error)
model.fit(X_train, y_train, batch_size=64, epochs=50)
When I call model.get_weights it shows substantial differences to the true values and the loss is still not even close to zero. I don't know why this model does not perform as well as the pytorch model. Even if you disregard the pytorch model, shoudln't the network converge to the true values in this tiny toy-dataset. What is my error in setting up the model?
EDIT: And here is my full pytorch code for comparison:
import torch
from torch.utils.data import DataLoader, Dataset, Sampler, SequentialSampler, RandomSampler
from torch import nn
from sklearn.model_selection import train_test_split
n = 10
k = 2
device = "cpu"
class Daten(Dataset):
def __init__(self, df):
self.df = df
self.ycol = df.shape[1] - 1
def __getitem__(self, index):
return self.df[index, :self.ycol], self.df[index, self.ycol:]
def __len__(self):
return self.df.shape[0]
def split_into(D, batch_size=64, **kwargs):
D_train, D_test = train_test_split(D, **kwargs)
df_train, df_test = Daten(D_train), Daten(D_test)
dl_train, dl_test = DataLoader(df_train, batch_size=batch_size), DataLoader(df_test, batch_size=batch_size)
return dl_train, dl_test
D = gen_data(n, k)
dl_train, dl_test = split_into(D, test_size=0.2)
class NeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Sequential(
nn.Linear(k, 1)
)
def forward(self, x):
ypred = self.linear(x)
return ypred
model = NeuralNetwork().to(device)
print(model)
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-1)
def train(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
model.train()
for batch, (X, y) in enumerate(dataloader):
X, y = X.to(device), y.to(device)
print(y.shape)
# Compute prediction error
pred = model(X)
loss = loss_fn(pred, y)
# Backpropagation
loss.backward()
optimizer.step()
optimizer.zero_grad()
if batch % 100 == 0:
loss, current = loss.item(), (batch + 1) * len(X)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
epochs = 50
for t in range(epochs):
print(f"Epoch {t + 1}\n-------------------------------")
train(dl_train, model, loss_fn, optimizer)
print("Done!")
EDIT:
I increased epochs dramatically. After epochs=1000 we come close to the true values. Therefore my best guess for the discrepancy is that tf applies some non-optimal initialization?
Solution
Your lr
parameter for SGD
is deprecated:
WARNING:absl:
lr
is deprecated in Keras optimizer, please uselearning_rate
or use the legacy optimizer, e.g.,tf.keras.optimizers.legacy.SGD.
If I use
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=1e-1), loss=tf.keras.losses.mean_squared_error)
Then I get loss: 7.0588e-05
(without bias: loss: 2.0572e-08
).
With my simple torch model, I got loss: 5.3355e-05
(without bias: loss: 5.3071e-09
).
It's interesting that the bias plays a negative role here, I think the relation between X and y is too linear for the bias to get used, but the model tries it anyways. If you'd add the line
y += np.random.rand(*y.shape)*0.2
to the data creation, then the model with bias will perform better for torch and TF, as there is actual bias in the relation between X and y and the model can learn this.
Answered By - mhenning
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.