Tool for phasing genomic graph data using parental or proximity ligation data.
Recent Updates:
- Proximity linkage phasing working consistently on Shasta and Verkko graphs with low and high bubble N50s
- PoreC tested and working
- WDL is online and can be used to automate the alignment + phasing steps! See below
Check out the repository here:
https://github.com/meredith705/gfase_wdl
Inputs are (1) assembly GFA, (2) read FASTA, (3) parameters (see below for min_mapq
, skip-unzip
, etc). Steps are dockerized and automated for ease of use.
If you cannot run a WDL or you prefer to perform the alignment step manually, use these directions. You will also need whichever aligner is most appropriate for your data type. We have used BWA-mem for Hi-C (illumina) or Minimap2 for Pore-C (ONT)
Tested on Ubuntu 20.04 and 22.04
git clone https://github.com/rlorigro/GFAse.git
cd GFAse
mkdir build
cd build
cmake ..
make -j [n_threads]
- cmake 3.0+
- c++17
- gcc
- OpenMP
- zlib1g-dev
- libbz2-dev
- libcurl4-openssl-dev
- libjansson-dev
App description
Usage: ./phase_contacts_with_monte_carlo [OPTIONS]
Options:
-h,--help Print this help message and exit
-i,--input TEXT REQUIRED Path to BAM containing proximity linked reads. MUST be grouped by read name. Does not need index or regional sorting.
-g,--gfa TEXT Path to GFA containing assembly graph to be phased
-o,--output_dir TEXT REQUIRED
Path to (nonexistent) directory where output will be stored
-m,--min_mapq INT (Default = 1) Minimum required mapq value for mapping to be counted.
-c,--core_iterations UINT (Default = 200) Number of iterations to use for each shallow convergence in the sampling process. The final phasing round uses 3*core_iterations.
-s,--sample_size UINT (Default = 30) How many shallowly converged phase states to sample from. This is also the maximum usable concurrency (n_threads) for this stage of the pipeline.
-r,--n_rounds UINT (Default = 2) How many rounds to sample and merge.
-t,--threads UINT (Default = 1) Maximum number of threads to use.
--use_homology (Default = 0) Use sequence homology to find alts. For whenever the GFA does not have Shasta node labels.
--skip_unzip (Default = 0) After phasing nodes in the graph, DON'T unzip/concatenate haplotypes before writing to fasta. Unzipping should be skipped when using overlapped GFAs because no stitching is performed.
bwa index Assembly-Phased.fasta \
&& \
bwa mem -t 46 -5 -S -P \
Assembly-Phased.fasta \
HG002.HiC_1_S1_R1_001.fastq \
HG002.HiC_1_S1_R2_001.fastq \
| samtools sort -n -@ 24 - -o hic_to_assembly.sorted_by_read.bam \
Once you have the BAM, pass it as an argument to GFAse, along with the GFA. Note that samtools and BWA are competing for threads.
x = Time (min)
This plot shows the alignment of 3 pairs of HiC sequencing libraries.
minimap2 \
-a \
-x map-ont \
-k 17 \
-t 56 \
-K 10g \
-I 8g \
Assembly-Phased.fasta \
porec_reads.fastq.gz \
| samtools view -bh -@ 8 -q 1 - \
> porec_vs_shasta_q1.bam
PoreC does not used paired files. It is strongly recommended that you filter the mapq with a pipe to avoid massive output BAMs. No sorting is needed under the assumption that minimap2 groups output by read name. Once you have the BAM, pass it as an argument to GFAse, along with the GFA.
x = Time (min)
This plot shows the alignment of 2 flowcells of PoreC.
/home/ubuntu/software/GFAse/build/phase_contacts_with_monte_carlo \
-i hic_to_assembly.sorted_by_read.bam \
-g Assembly-Phased.gfa \
-o /path/to/output/directory/ \
-m 1 \
-t 62
Run time depends on the size of the BAM and the number of nodes in the graph. Filtering by map quality first (q>0) will save loading time.
[Resource usage plots to be added]
/home/ubuntu/software/GFAse/build/phase_contacts_with_monte_carlo \
-i hic_to_assembly.sorted_by_read.bam \
-g Assembly-Phased.gfa \
-o /path/to/output/directory/ \
--use_homology \
--skip_unzip \
-m 3 \
-t 62
Run time depends on the size of the BAM and the number of nodes in the graph, however, additional run time and memory usage is incurred as a result of needing to rediscover alts/homologs in the GFA (with flag --use_homology
). ~128GB and 48 threads should be sufficient.
Note: Verkko should be homopolymer decompressed before running alignments and GFAse. Verkko produces overlapped GFAs which are not stitchable by GFAse, so --skip_unzip
should be used.
x = Time (min)
In general, this depends on your coverage, the assembly quality, and your sequencing type. A rough guide is available below:
Pore-C Hi-C
Verkko assembly: 3 3
Shasta R9 assembly: 3 1
Shasta R10 assembly: 3 1
You may consider lowering your cutoff if you have low coverage in your mappings. This can be checked in the contacts.csv
output file. There is also an executable in GFAse to measure the total distribution of mapQ for any BAM or SAM.
A list of unique parental (separated into maternal/paternal fastas) kmers is required as input.
/home/ubuntu/software/GFAse/build/phase_haplotype_paths \
-i /home/ubuntu/data/human/hg002/assembly/Assembly-Detailed.gfa \
-p /home/ubuntu/data/human/kmer/hg03.48_55.unique.k31.fa \
-m /home/ubuntu/data/human/kmer/hg04.54_61.unique.k31.fa \
-k 31
The plot below shows phase assignment of bubbles w.r.t. a diploid reference alignment.
A recent ultra-long nanopore Shasta assembly with high phase-block contiguity after k-mer phasing (NG50 = 33.1Mbp)
Three examples of shifts in NGx in Shasta assemblies after applying GFAse with k-mers. Note that initial phased contiguity in Shasta is not necessarily a predictor of phased contiguity after GFAse.
Lorig-Roach, Ryan, Melissa Meredith, Jean Monlong, Miten Jain, Hugh Olsen, Brandy McNulty, David Porubsky, et al. 2023. “Phased Nanopore Assembly with Shasta and Modular Graph Phasing with GFAse.” bioRxiv. https://doi.org/10.1101/2023.02.21.529152.