Skip to content

Early scalability demonstrations #345

Closed
@alimanfoo

Description

@alimanfoo

It would be very useful to have some simple demonstrations that using sgkit and the underlying stack we can run some computations over large input datasets, showing we are making real steps towards the goal of interactive analysis of biobank-scale datasets, given a sufficiently powerful HPC/cloud/GPU cluster.

The main purpose of this is to be able to give some highlights in a short report on what has been achieved from the sgkit development work to date.

By "demonstration" I mean just a simple benchmark result.

E.g., being able to say that we can run a PCA on UK biobank-sized data using a N node cluster in Google Cloud and compute the result in M minutes.

E.g., being able to say we can run a pairwise distance computation on the MalariaGEN Ag1000G phase 2 data and compute the result in M minutes on a N node CPU cluster in Google Cloud. Even better would be being able to say we can also run the same computation in M' minutes on a N' node GPU cluster.

I.e., I just need a couple of data points here showing proof of concept, I don't need any detailed performance or scalability benchmarking.

Raising this issue to share suggestions for computations/datasets to use, and post back any results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions