Overview The BERT model was proposed in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It’s a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia. The abstract
![BERT](https://cdn-ak-scissors.b.st-hatena.com/image/square/8f73f6dfebbab072005a368dc5788aacbd7c0e7c/height=288;version=1;width=512/https%3A%2F%2Fhuggingface.co%2Ffront%2Fthumbnails%2Fdocs%2Ftransformers.png)