Fine-tuning a pre-trained language model (LM) has become the de facto standard for doing transfer learning in natural language processing. Over the last three years (Ruder, 2018), fine-tuning (Howard & Ruder, 2018) has superseded the use of feature extraction of pre-trained embeddings (Peters et al., 2018) while pre-trained language models are favoured over models trained on translation (McCann et