Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use HAPNEST data for gwas demo? #43

Open
jeromekelleher opened this issue Nov 23, 2023 · 8 comments
Open

Use HAPNEST data for gwas demo? #43

jeromekelleher opened this issue Nov 23, 2023 · 8 comments

Comments

@jeromekelleher
Copy link
Collaborator

The 1 million sample HAPNEST dataset (https://github.com/pystatgen/sgkit/discussions/1144#discussioncomment-7654640 ) seems ideal for our purposes.

Larger thank ukb, and no messing with data access problems. Also lets us showcase our plink format support.

Any thoughts @hammer ?

@jeromekelleher
Copy link
Collaborator Author

Also includes phenotypes, btw

@jeromekelleher
Copy link
Collaborator Author

The advantages of a fully reproducible analysis pipeline to go along with the paper seems compelling to me. Working with something like UKB inevitably introduces friction. This synthetic dataset has been carefully curated for realism, and I'm not sure what extra we'd be showing by working with actual data.

There's a neatness to demonstrating that we can work with two different synthetic datasets at the 1 million sample scale, through both VCF and plink.

If we make it a requirement that all of the things that go into the paper are fully reproducible (which chimes well with the overall philosophy of openness), and we want to do something at the largest scale, then this seems like a great way to go.

@hammer
Copy link

hammer commented Nov 27, 2023

I will have a look this week! I've been using GitHub Codespaces so far for my explorations and will need to think about how scaling experiments. We hit some scalability issues last time we tried to do a GWAS at the UKB scale (https://github.com/pystatgen/sgkit/issues/390) so I may also need to get some help resolving those issues.

A quick look at the S-BSST936 listing shows the .bed files range from 141.37 GB (chr2) to 27.64 GB (chr21). I wonder if anyone has put this data on a cloud object store already? I'll poke around a bit to save myself the download time.

two different synthetic datasets at the 1 million sample scale, through both VCF and plink.

@jeromekelleher forgive my ignorance but do we have a VCF synthetic data set at this scale as well?

@hammer
Copy link

hammer commented Nov 27, 2023

Some places to look for this data on cloud storage already:

@jeromekelleher
Copy link
Collaborator Author

jeromekelleher commented Nov 27, 2023

@jeromekelleher forgive my ignorance but do we have a VCF synthetic data set at this scale as well?

Yep - our data/basic compute task scaling figure goes up to a million samples, taken as subsets of the 1.4M in the simulations provided in this paper

(Note: @benjeffery and I are planning to add another line for the SAV file format/C++ toolkit here. Fig is also quite drafty, obvs)

fig1

@hammer
Copy link

hammer commented Nov 29, 2023

Okay figured out their FTP structure, everything is under ftp://ftp.ebi.ac.uk//biostudies/fire/S-BSST/936/S-BSST936/Files. Will start moving to a cloud store now.

For my reference, I'm using a command like:

curl ftp://ftp.ebi.ac.uk//biostudies/fire/S-BSST/936/S-BSST936/Files/example/<file> | gsutil cp - gs://<bucket>/<file>

Transfer speeds not so bad, seeing around 27 MiB/s, will take about 17 minutes for chr21 and probably 2 hours or so for chr1. Will kick off a big transfer tomorrow.

@hammer
Copy link

hammer commented Dec 3, 2023

Okay I've gotten our GWAS demo running using one chromosome and one phenotype of the example (600 subjects) data.

Notebook is at https://github.com/hammer/sgkitpub/blob/main/hapnest_gwas.ipynb

Some thoughts:

  • GWAS demo uses a quantitative trait, while HAPNEST has binary traits.
  • GWAS demo uses a VCF file with sequencing features, so QC is a lot more interesting.
  • At least for the first phenotype on the first chromosome, there's no association to find.

I will next try to scale to all chromosomes and all phenotypes on the example data, then go to the big dataset.

@hammer
Copy link

hammer commented Dec 24, 2023

Just noting for myself that tools from other language ecosystems that might be fun to try out in this section would be https://github.com/privefl/bigsnpr (GWAS docs) and https://github.com/OpenMendel/MendelGWAS.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants