The readme below is only intended for expert developers working on the package. It describes the general folder structure
The hapROH
software package contains two primary modules and a range of functions to visualize the results.
#. hapROH
identifies runs of homozygosity (ROH) in ancient and present-day DNA by using a panel of reference haplotypes. As input, it uses genotype files in eigentstrat format. This module contains functions and wrappers to call ROH and functions for downstream analysis and visualization. For downward compatibility, this software uses hapsburg
as module name.
#. hapCON
estimates contamination in aDNA data of male individuals by using a panel of reference haplotypes. It works directly from BAM file or from samtools mpileup output.
Software to call ROH from ancient and present-day DNA using reference haplotypes. Author: Harald Ringbauer, September 2020
hapCon is an extension of hapROH for estimaing contamination rate for male aDNA samples. Author: Yilei Huang, Dec 2021
Joint estimation of ROH blocks and Contamination Rate. Author: Yilei Huang, Dec 2021
The code behind the hapROH package can be found in `./package/
Large data is not shared (the ./Data
folder is git-ignored) via github. You will need to get access to that data, e.g. download the 1000G data directly and run the notebooks to process them into the required hdf5 format (see below). For simulated data, you can generate data via running the simulation notebooks described below.
These scripts can prepare the reference data
For 1240k reference data:
There is a notebook prepare_1000genomes.ipynb
:
It downloads 1000 Genomes to local machine and prepares
downsampled and clean 1240k data (subsampled to biallelic SNPs), and saves it to hdf5
In order to run the code, you need to build the C extension that implements the forward/backward pass.
If you do not install hapROH via PIP (which automatically installs the C extensions) and use the development version (from github), you have to compile the C extension yourself. You can do this via:
Switch to the C folder and build the extensions via:
cd package/hapsburg
module load gcc
module load python
cythonize -a -i cfunc.pyx
This compiles a C file into package/hapsburg. Manually importing from hapsburg.PackagesSupport.hapsburg_run import hapsb_ind
uses this functions.
Tested with gcc/6.1 and python/anaconda-2020.02.
./Data Folder is git-ignored. That's where the data goes into.
Contains one column lists of individuals to use
Contains 1240k Eigenstrat .snp, that specifies genomic position and map position of every SNP
Contains the input VCFs, downloaded from 1000 Genomes
Contains also the processed VCFs, downsampled to
Contains the full HDF5s (used temporarily to create downsampled HDF5s)
Contains the final product. Downsampled hdf5s to 1240k biallelic SNPs
Contains the Eigenstrat from the Reich website. Use script in ./Notebooks/PrepareData/prepare_Reich_Eigenstrat to download and unzip it.
Assumes that the cluster uses a slurm
job scheduling system.
First run:
sbatch ./jupyter.sbatch
and then collect the URL from the .err file of the job.
Afterwards: Do not forget to scancel the job otherwise it runs until the time limit is hit.
This allows one to run memory or cpu-intense tasks via a notebook on the cluster.
For debugging/info: Show job account information for a specific job: sacct -j jobid --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
Notebooks to run hapROH in parallel on simulated as well as empirical data are found in
./Notebooks/ParallelRuns/
these run Hapsburg in a parallelized fashion, assuming that the notebook is run on a cluster with enough nodes (which can be set). The general strategy is to produce lists of parameter files (with one list of parameters per run, e.g. which individual and chromosome), which are then spread out by 'pool.starmap' to nodes via a multirun function.
Code can be found in ./cluster_runs/
Usually the submission scripts are written as array jobs: Each folder contains a python file that runs a single individual, taking an integer as input. The scripts map this integer to individuals using the accoring meta files (somewhere in ./data/)
The second file in each folder is slurm_batch.sh
which is the shell submission script. Modify the .sh scripts to specify what batch to run (i.e. how many and which individuals to submit)
They are submitted with:
sbatch slurm_batch.sh
Usually Analysis per chromosomes are run, and results saved into folders: basefolder/iid/chr0/
There is a helper function combine_individual_data
from ./packagesSupport/parallel_runs/helper_functions
that combines these results per Chromosome into a .csv per Individual, with a option to delete the Chromosome Result folder to save space. For detailled downstream analysis (e.g. of the Posterior), the delete
argument has to be False to keep the posterior file. The standard output ist base_folder/IID_roch_full.csv
These jobs produce summary files for each individual (ususally with iid_roh_full.csv
) with fields
(Start,End,StartM,EndM,length,lengthM,iid,ch)
These are then further post-process with functions in ./packagesSupport/pp_individual_roh_csvs.py
.
The ususual structure of doing this is:
- Produce list of paths to .csvs
- Iterate over these csvs, extract and postprocess (e.g. gap-merging) these
- These are then merged into one master .csv file with the fields, saved as
combined_roh.csv
with columns (iid, pop, max_roh, sum_roh, n_roh, lat, lon, age), where some summary details can be merged back in from the master meta dataframe. The function to do so can be found in pp_individual_roh_csvs.py in PackagesSupport. This function is typically called in the ParallelRun notebooks, towards the end
One example of this workflow: ./Notebooks/PrepareData/roh_process.ipynb
These combined csvs are then combined again in 'combine_roh_results.ipynb' into one giant master dataframe. In this function, also "region" columns are defined, which are the basis for downstream processing.
There is a notebook in Notebooks/PrepareData
to prepare the Meta, as well as the hdf5 in the right format.
All individuals can be sbatched via a script in packagesSupport/cluster_runs/Antonio_callROH/
Singe Individuals can be rerun with ./Notebooks/ParallelRuns/parallel_antonio19.ipynb
. In this file, there is also the code to combine all indivdiual
outputs into one summary .csv, that can than be used for downstream analysis (such as plotting all individuals)
Notebooks that plot the data can be found in ./Notebooks/figures/
In particular, posterior plotting is found in plot_posetrior.ipynb
.
Geographic and temporal figure production is found in plot_map_ROH.ipynb
For pseudo-haploid 1240k data: 2-10 gb per Chromosome (during Internal Calculations), depending on target coverage.
Total runtime single CPU (Intel Xeon E5-2680 v4 (2.40GHz) processor) for one individual (22 Autosomes): Up to 15 minutes.
Notebooks that wrap shell commands for these two are found in ./Notebooks/PLINK
.
General strategy: Analyze the Mosaic Individuals with copied in ROH blocks. For that run the tools on the datasets with 100 individuals.
First transfer H5 to VCF (for BCFTOOLS also the PL genotype likelihood field in bcftools is needed). Tools for that conversion are found in ./PackagesSupport/h5_python/h5_functions.py
.
After running the VCF (saving into output folders, after transforming to rough "HAPSBURG" ROH format), 2) split up the output .csv (or dataframe) into the individual output folders. These can then be analyzed in the same way as HAPSBURG outputs, with tools in ./Notebooks/Mosaic1000G_Analysis
-->