Description
It would be very useful to have some simple demonstrations that using sgkit and the underlying stack we can run some computations over large input datasets, showing we are making real steps towards the goal of interactive analysis of biobank-scale datasets, given a sufficiently powerful HPC/cloud/GPU cluster.
The main purpose of this is to be able to give some highlights in a short report on what has been achieved from the sgkit development work to date.
By "demonstration" I mean just a simple benchmark result.
E.g., being able to say that we can run a PCA on UK biobank-sized data using a N node cluster in Google Cloud and compute the result in M minutes.
E.g., being able to say we can run a pairwise distance computation on the MalariaGEN Ag1000G phase 2 data and compute the result in M minutes on a N node CPU cluster in Google Cloud. Even better would be being able to say we can also run the same computation in M' minutes on a N' node GPU cluster.
I.e., I just need a couple of data points here showing proof of concept, I don't need any detailed performance or scalability benchmarking.
Raising this issue to share suggestions for computations/datasets to use, and post back any results.