Probing models It was very refreshing to see that rather than introducing ever shinier new models, many papers methodically investigated existing models and what they capture. This was most commonly done by automatically creating a dataset that focuses on one particular aspect of the generalization behaviour and evaluating different trained models on this dataset: Conneau et al. for instance evalu