Repeatable sampling of data sets in BigQuery for machine learning An efficient, fast, and repeatable selection method that works on very large data sets. Doing machine learning on distributed data sets is methodologically similar to working with data that fits in-memory—train your algorithm on a subset of the data, validate on another subset, and finally test with a different subset. In this post,