Terminal Phase of Training(TPT) — Training beyond 0 training error i.e training error is at 0 while we’re pushing training loss further down. Aim is to reduce the loss as much as possible even if misclassification rate is already 0. Why would someone do that? One would expect such a model to be highly overfitted to the training data, and noisy but recently it’s shown empirically that the reverse i