Sunday, August 7, 2022

[FIXED] Huggingface Trainer load_best_model f1 score vs. loss and overfitting

August 07, 2022 huggingface-transformers, machine-learning, nlp, optimization, pytorch No comments

Issue

I have trained a roberta-large and specified load_best_model_at_end=True and metric_for_best_model=f1. During training, I can see overfitting after the 6th epoch, which is the sweetspot. In Epoch 8, which is the next one to evaluate due to gradient accumulation, we can see that train loss decreases and eval_loss increases. Thus, overfitting starts. The transformers trainer in the end loads the model from epoch 8, checkpoint -14928, as the f1 score is a bit highea. I was wondering, in theory, wouldn't be the model from epoch 6 be better suited, as it did not overfit? Or does one really go for the f1 metric here even though the model did overfit? (the eval loss decreased in epochs <6 constantly).

The test_loss from the second checkpoint, which is then loaded as the "best", is 0.128. Is it possible to lower that using the first checkpoint which should be the better model anyway?

checkpoint-11196:
{'loss': 0.0638, 'learning_rate': 8.666799323450404e-06, 'epoch': 6.0}

{'eval_loss': 0.09599845856428146, 'eval_accuracy': 0.9749235986101227, 'eval_precision': 0.9648319293367138, 'eval_recall': 0.9858766505097777, 'eval_f1': 0.9752407721241682, 'eval_runtime': 282.2294, 'eval_samples_per_second': 84.637, 'eval_steps_per_second': 2.647, 'epoch': 6.0}

VS.

checkpoint-14928:
{'loss': 0.0312, 'learning_rate': 7.4291115311909265e-06, 'epoch': 8.0}

{'eval_loss': 0.12377820163965225, 'eval_accuracy': 0.976305103194206, 'eval_precision': 0.9719324391455539, 'eval_recall': 0.9810295838208257, 'eval_f1': 0.9764598236566295, 'eval_runtime': 276.7619, 'eval_samples_per_second': 86.309, 'eval_steps_per_second': 2.699, 'epoch': 8.0}

Solution

You could just comment the metric_for_best_model='f1' part out and see for yourself, loss is the default setting. Or, utilize from_pretrained('path/to/checkpoint') to compare two checkpoints back to back. F-score is threshold sensitive, so it's entirely possible for a lower loss checkpoint to be better in the end (assuming you do optimize the threshold).

Answered By - dx2-66

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, August 7, 2022

[FIXED] Huggingface Trainer load_best_model f1 score vs. loss and overfitting

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels