Description
With the ongoing work to provide togglable CPU/GPU execution, we need to provide strong guarantees about the equivalency of results for executing on each device. Toward that end, we should begin using Hypothesis to generate difficult inputs and compare output from both CPU and GPU execution. Having a utility that makes it easy to do so for both classifiers and regressors would be extremely beneficial.
The Hypothesis testing we do on the FIL backend offers some things to think about here. Most importantly, for generating large datasets, we probably do not want to have Hypothesis generate a single array all at once but instead generate several smaller arrays and concatenate them together. That tends to allow us to explore more diverse inputs more quickly at the cost of a more difficult "narrowing" process for Hypothesis. For small datasets, generating arrays individually is probably more beneficial.
It's an open question how much we want to use Hypothesis for training datasets as opposed to inference. While testing on oddly-constructed training data may expose some corner cases, it may also lead to relatively uninteresting or trivial models. Having Hypothesis use ordinary make_blobs
type dataset constructors for some large fraction of the training data may be a good idea.