Published: 20 Dec 2019, Last Modified: 22 Oct 2023ICLR 2020 Conference Blind SubmissionReaders: Everyone Abstract: Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amo