This repository provides a library and set of utilities for the efficient loading of phenotype and genotype data from the UK Biobank.
Features include:
- Loading quantitative and categorical phenotypes, includeding self-reported phenotypes and phenotypes based on ICD-10 disease codes.
- Fast parallelized loading that leverages chunked and compressed Zarr arrays.
- Utilities for splitting the dataset samples randomly, or based on a predefined structure.
First, the UKB dataset needs to be converted into the Zarr format with the desired test/train/validation split. For this, use the provided conversion script.
For examples on loading various types of phenotypes, see this example notebook.