Tuesday, September 20, 2022

[FIXED] gpu memory is still occupied after validation phase is finished, pytorch

September 20, 2022 gpu, pytorch No comments

Issue

As far as I know, when training and validating a model with GPU, GPU memory is mainly used for loading data, forward & backward. and to what I know, I think GPU memory usage should be same 1)before training, 2)after training, 3)before validation, 4)after validation. But in my case, GPU memory used in the validation phase is still occupied in the training phase and vice versa. It is not increasing per epoch so I'm sure it is not a common mistake like loss.item().

Here is the summary of my question

Shouldn't the GPU memory used in one phase be cleaned up before another(except for model weights)?
If it should, are there any noob mistakes I'm making here..?

Thank you for your help.

Here is the code for training loop

eval_result = evaluate(model,val_loader,True,True)
print(eval_result)

print('start training')
for epoch in range(num_epoch):
    model.train()
    time_ = datetime.datetime.now()
    for iter_, data in enumerate(tr_loader):
        x, y = data
        x = x.to(device).view(x.shape[0],1,*(x.shape[1:]))
        y = y.to(device).long()
        
        pred = model.forward(x)
        loss = loss_fn(pred,y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # print
        print_iter = 16
        if (iter_+1) % print_iter == 0:
            elapsed = datetime.datetime.now() - time_
            expected = elapsed * (num_batches / print_iter)
            _epoch = epoch + ((iter_ + 1) / num_batches)
            print('\rTRAIN [{:.3f}/{}] loss({}) '
                'elapsed {} expected per epoch {}'.format(
                    _epoch,num_epoch, loss.item(), elapsed, expected)
                 ,end="\t\t\t")
            time_ = datetime.datetime.now()
        
        
        
    print()
    eval_result = evaluate(model,val_loader,True,True)
    print(eval_result)
    scheduler.step(eval_result[0])
    
    if (epoch+1) %1 == 0:
        save_model(model, optimizer, scheduler)

I've read about how making a validation phase a function helps since python is function scoping language.
so the evaluate() is

def evaluate(model, val_loader, get_acc = True, get_IOU = True):
    """
    pred: Tensor of shape B C D H W
    label Tensor of shape B D H W
    """
    val_loss = 0
    val_acc = 0
    val_IOU = 0
    with torch.no_grad():
        model.eval()
        for data in tqdm(val_loader):
            x, y = data
            x = x.to(device).view(x.shape[0],1,*(x.shape[1:]))
            y = y.to(device).long()

            pred = model.forward(x)
                    
            loss = loss_fn(pred,y)
            val_loss += loss.item()
        
            pred = torch.argmax(pred, dim=1)
            
            if get_acc:
                total = np.prod(y.shape)
                total = total if total != 0 else 1
                val_acc += torch.sum((pred == y)).cpu().item()/total
            if get_IOU:
                iou = 0
                for class_num in range(1,8):
                    iou += torch.sum((pred==class_num)&(y==class_num)).cpu().item()\
                        / torch.sum((pred==class_num)|(y==class_num)).cpu().item()
                val_IOU += iou/7
                
        val_loss /= len(val_loader)
        val_acc /= len(val_loader)
        val_IOU /= len(val_loader)
    return (val_loss, val_acc, val_IOU)

and here is GPU usage in colab. 1 is the point where the evaluate() is first called, and 2 is when the train started.

Solution

Allocating GPU memory is slow. PyTorch retains the GPU memory it allocates, even after no more tensors referencing that memory remain. You can call torch.cuda.empty_cache() to free any GPU memory that isn't accessible.

Answered By - jodag

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, September 20, 2022

[FIXED] gpu memory is still occupied after validation phase is finished, pytorch

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels