You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is meant to track changes needed to improve benchmarks reproducibility and meaningfulness, now that our benchmarks are running on a dedicated cluster node (thanks @JuanPedroGHM !) and runtime variance has decreased dramatically.
do we only benchmark high-level routines (e.g., standard scaler) and leave away the low level routines (here: mean, std, in-place subtraction and division) or do we benchmark both?
do we choose a representative case or do we want to try to cover as many cases as possible, e.g.: cover all split combinations or only a representative one (how to determine that?)? include non-split/trivially-parallel routines, too?
do we aim at (at least for the beginning) more or less equal run times for each benchmark or more or less equal size of the data for each benchmark? or do we pursue a completely different approach for the benchmarks?
regarding the names: do we want to keep the old names in order to be able to show the old data as well or do we want to start completely from scratch again once we have set up new benchmark sizes/additional benchmarks?
This issue is meant to track changes needed to improve benchmarks reproducibility and meaningfulness, now that our benchmarks are running on a dedicated cluster node (thanks @JuanPedroGHM !) and runtime variance has decreased dramatically.
As usual, feel free to add/edit.
The text was updated successfully, but these errors were encountered: