Issue
I'm implementing a SimCLR/ResNet18 architecture over a custom dataset.
I know that
Number of Iterations in One Epoch=Batch Size/Total Training Dataset Size
And if the result is floating then the size of the last batch is adapted for the leftovers (the 'fractional batch'). However in my case, this last mechanism does not seem to work. My dataset is of size 7000. If I give a batch size of 100, I then have 7000/70=100 iterations, without fractional batch and the training goes on. However, if I give a batch size of 32 for instance, then I have the following error (full stack trace)
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/bin/python /home/wlutz/PycharmProjects/hiv-image-analysis/main.py
2023-10-20 11:12:22.106008: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-20 11:12:22.107921: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-20 11:12:22.133919: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-20 11:12:22.133941: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-20 11:12:22.133955: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-20 11:12:22.138715: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-20 11:12:22.737271: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pl_bolts/__init__.py:11: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar.
if not hasattr(numpy, tp_name):
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pl_bolts/__init__.py:11: FutureWarning: In the future `np.bool` will be defined as the corresponding NumPy scalar.
if not hasattr(numpy, tp_name):
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pl_bolts/models/self_supervised/amdim/amdim_module.py:34: UnderReviewWarning: The feature generate_power_seq is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
"lr_options": generate_power_seq(LEARNING_RATE_CIFAR, 11),
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pl_bolts/models/self_supervised/amdim/amdim_module.py:92: UnderReviewWarning: The feature FeatureMapContrastiveTask is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
contrastive_task: Union[FeatureMapContrastiveTask] = FeatureMapContrastiveTask("01, 02, 11"),
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pl_bolts/losses/self_supervised_learning.py:228: UnderReviewWarning: The feature AmdimNCELoss is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
self.nce_loss = AmdimNCELoss(tclip)
available_gpus: 0
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
warnings.warn(msg)
Dim MLP input: 512
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:478: LightningDeprecationWarning: Setting `Trainer(gpus=0)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='gpu', devices=0)` instead.
rank_zero_deprecation(
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:613: UserWarning: Checkpoint directory /home/wlutz/PycharmProjects/hiv-image-analysis/saved_models exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
/home/wlutz/PycharmProjects/hiv-image-analysis/main.py:330: UnderReviewWarning: The feature LinearWarmupCosineAnnealingLR is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
scheduler_warmup = LinearWarmupCosineAnnealingLR(optimizer, warmup_epochs=10, max_epochs=max_epochs,
| Name | Type | Params
------------------------------------------
0 | model | AddProjection | 11.5 M
1 | loss | ContrastiveLoss | 0
------------------------------------------
11.5 M Trainable params
0 Non-trainable params
11.5 M Total params
46.024 Total estimated model params size (MB)
Optimizer Adam, Learning Rate 0.0003, Effective batch size 160
Epoch 0: 100%|█████████▉| 218/219 [04:03<00:01, 1.12s/it, loss=3.74, v_num=58, Contrastive loss_step=3.650]Traceback (most recent call last):
File "/home/wlutz/PycharmProjects/hiv-image-analysis/main.py", line 388, in <module>
trainer.fit(model, data_loader)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 213, in advance
batch_output = self.batch_loop.run(kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 202, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 249, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 370, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1754, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step
return self.precision_plugin.optimizer_step(
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 119, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/torch/optim/optimizer.py", line 373, in wrapper
out = func(*args, **kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/torch/optim/adam.py", line 143, in step
loss = closure()
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 105, in _wrap_closure
closure_result = closure()
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 149, in __call__
self._result = self.closure(*args, **kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 135, in closure
step_output = self._step_fn()
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 419, in _training_step
training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 378, in training_step
return self.model.training_step(*args, **kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/main.py", line 316, in training_step
loss = self.loss(z1, z2)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/wlutz/PycharmProjects/hiv-image-analysis/main.py", line 243, in forward
denominator = device_as(self.mask, similarity_matrix) * torch.exp(similarity_matrix / self.temperature)
RuntimeError: The size of tensor a (64) must match the size of tensor b (48) at non-singleton dimension 1
Process finished with exit code 1
Here is some code (error happens at last line):
train_config = Hparams()
reproducibility(train_config)
model = SimCLR_pl(train_config, model=resnet18(pretrained=False), feat_dim=512)
transform = Augment(train_config.img_size)
data_loader = get_stl_dataloader(train_config.batch_size, transform)
accumulator = GradientAccumulationScheduler(scheduling={0: train_config.gradient_accumulation_steps})
checkpoint_callback = ModelCheckpoint(filename=filename, dirpath=save_model_path, every_n_epochs=2,
save_last=True, save_top_k=2, monitor='Contrastive loss_epoch', mode='min')
trainer = Trainer(callbacks=[accumulator, checkpoint_callback],
gpus=available_gpus,
max_epochs=train_config.epochs)
trainer.fit(model, data_loader)
and here are my classes:
class Hparams:
def __init__(self):
self.epochs = 10 # number of training epochs
self.seed = 33333 # randomness seed
self.cuda = True # use nvidia gpu
self.img_size = 224 # image shape
self.save = "./saved_models/" # save checkpoint
self.load = False # load pretrained checkpoint
self.gradient_accumulation_steps = 5 # gradient accumulation steps
self.batch_size = 70
self.lr = 3e-4 # for ADAm only
self.weight_decay = 1e-6
self.embedding_size = 128 # papers value is 128
self.temperature = 0.5 # 0.1 or 0.5
self.checkpoint_path = '/media/wlutz/TOSHIBA EXT/Image Analysis/VIH PROJECT/models' # replace checkpoint path here
class SimCLR_pl(pl.LightningModule):
def __init__(self, config, model=None, feat_dim=512):
super().__init__()
self.config = config
self.model = AddProjection(config, model=model, mlp_dim=feat_dim)
self.loss = ContrastiveLoss(config.batch_size, temperature=self.config.temperature)
def forward(self, X):
return self.model(X)
def training_step(self, batch, batch_idx):
(x1, x2) = batch
z1 = self.model(x1)
z2 = self.model(x2)
loss = self.loss(z1, z2)
self.log('Contrastive loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
return loss
def configure_optimizers(self):
max_epochs = int(self.config.epochs)
param_groups = define_param_groups(self.model, self.config.weight_decay, 'adam')
lr = self.config.lr
optimizer = Adam(param_groups, lr=lr, weight_decay=self.config.weight_decay)
print(f'Optimizer Adam, '
f'Learning Rate {lr}, '
f'Effective batch size {self.config.batch_size * self.config.gradient_accumulation_steps}')
scheduler_warmup = LinearWarmupCosineAnnealingLR(optimizer, warmup_epochs=10, max_epochs=max_epochs,
warmup_start_lr=0.0)
return [optimizer], [scheduler_warmup]
class AddProjection(nn.Module):
def __init__(self, config, model=None, mlp_dim=512):
super(AddProjection, self).__init__()
embedding_size = config.embedding_size
self.backbone = default(model, models.resnet18(pretrained=False, num_classes=config.embedding_size))
mlp_dim = default(mlp_dim, self.backbone.fc.in_features)
print('Dim MLP input:', mlp_dim)
self.backbone.fc = nn.Identity()
# add mlp projection head
self.projection = nn.Sequential(
nn.Linear(in_features=mlp_dim, out_features=mlp_dim),
nn.BatchNorm1d(mlp_dim),
nn.ReLU(),
nn.Linear(in_features=mlp_dim, out_features=embedding_size),
nn.BatchNorm1d(embedding_size),
)
def forward(self, x, return_embedding=False):
embedding = self.backbone(x)
if return_embedding:
return embedding
return self.projection(embedding)
class ContrastiveLoss(nn.Module):
"""
Vanilla Contrastive loss, also called InfoNceLoss as in SimCLR paper
"""
def __init__(self, batch_size, temperature=0.5):
super().__init__()
self.batch_size = batch_size
self.temperature = temperature
self.mask = (~torch.eye(batch_size * 2, batch_size * 2, dtype=bool)).float()
def calc_similarity_batch(self, a, b):
representations = torch.cat([a, b], dim=0)
similarity_matrix = F.cosine_similarity(representations.unsqueeze(1), representations.unsqueeze(0), dim=2)
return similarity_matrix
def forward(self, proj_1, proj_2):
"""
proj_1 and proj_2 are batched embeddings [batch, embedding_dim]
where corresponding indices are pairs
z_i, z_j in the SimCLR paper
"""
batch_size = proj_1.shape[0]
z_i = F.normalize(proj_1, p=2, dim=1)
z_j = F.normalize(proj_2, p=2, dim=1)
similarity_matrix = self.calc_similarity_batch(z_i, z_j)
sim_ij = torch.diag(similarity_matrix, batch_size)
sim_ji = torch.diag(similarity_matrix, -batch_size)
positives = torch.cat([sim_ij, sim_ji], dim=0)
nominator = torch.exp(positives / self.temperature)
# print(" sim matrix ", similarity_matrix.shape)
# print(" device ", device_as(self.mask, similarity_matrix).shape, " torch exp ", torch.exp(similarity_matrix / self.temperature).shape)
denominator = device_as(self.mask, similarity_matrix) * torch.exp(similarity_matrix / self.temperature)
all_losses = -torch.log(nominator / torch.sum(denominator, dim=1))
loss = torch.sum(all_losses) / (2 * self.batch_size)
return loss
class ImageDataResourceDataset(VisionDataset):
train_list = ['train_X_v1.bin', ]
test_list = ['test_X_v1.bin', ]
def __init__(self, root: str, transform: Optional[Callable] = None, ):
super().__init__(root=root, transform=transform)
self.data = self.__loadfile(self.train_list[0])
def __len__(self) -> int:
return self.data.shape[0]
def __getitem__(self, idx):
img = self.data[idx]
img = np.transpose(img, (1, 2, 0))
img = Image.fromarray(img)
img = self.transform(img)
return img
def __loadfile(self, data_file: str) -> np.ndarray:
path_to_data = os.path.join(os.getcwd(), 'datasets', data_file)
everything = np.fromfile(path_to_data, dtype=np.uint8)
images = np.reshape(everything, (-1, 3, 224, 224))
images = np.transpose(images, (0, 1, 3, 2))
return images
For records, my dataset has 7000 RGB images of size 224x224.
How come my last 'fractional' batch is not supported ? Many thanks for your help.
Solution
I found the original source code on the GitHub repository at: https://github.com/The-AI-Summer/simclr/blob/main/AI_Summer_SimCLR_Resnet18_STL10.ipynb
Based on the error message you provided:
File "/home/wlutz/PycharmProjects/hiv-image-analysis/main.py", line 243, in forward
denominator = device_as(self.mask, similarity_matrix) * torch.exp(similarity_matrix / self.temperature)
The problem seems to be originating from this line of code:
denominator = device_as(self.mask, similarity_matrix) * torch.exp(similarity_matrix / self.temperature)
within the class ContrastiveLoss(nn.Module)
.
To investigate the issue with this class, let's first examine its initialization variables:
def __init__(self, batch_size, temperature=0.5):
super().__init__()
self.batch_size = batch_size
self.temperature = temperature
self.mask = (~torch.eye(batch_size * 2, batch_size * 2, dtype=bool)).float()
The error appears to be related to the batch_size
. We need to identify which part of the code calls ContrastiveLoss()
.
Now, let's locate where ContrastiveLoss()
is used within the class SimCLR_pl(pl.LightningModule)
:
self.loss = ContrastiveLoss(config.batch_size, temperature=self.config.temperature)
The problem lies in config.batch_size
. The definition of ContrastiveLoss()
relies on config.batch_size
.
You are using 7000 data points with a batch size of 32. Therefore, the batch size for the final iteration is 7000 % 32 = 24.
Because the __init__
function utilizes:
self.mask = (~torch.eye(batch_size * 2, batch_size * 2, dtype=bool)).float()
The function expects the size to be config.batch_size
* 2 = 64. However, your size is now 24 * 2 = 48. This discrepancy is the cause of the error message:
RuntimeError: The size of tensor a (64) must match the size of tensor b (48) at non-singleton dimension 1
To resolve this issue, you should make sure that the size of batch_size aligns correctly with the expectations of the ContrastiveLoss()
class.
Answered By - Chih-Hao Liu
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.