Skip to content

Latest commit

 

History

History
321 lines (257 loc) · 16.1 KB

0.Preparation.md

File metadata and controls

321 lines (257 loc) · 16.1 KB

Preparation

Variables:

DIR_BASE=/lizardfs/guarracino/pggb-paper

WFMASH=/home/guarracino/tools/wfmash/build/bin/wfmash-0b191bb84ffdfd257354c1aa82a7f1e13dc536d0

Data

A. thaliana

82 assemblies in total: on 2023/11/30, specifying Status: Latests and Exclude: Exclude partial (txid3702[Organism] AND (latest[filter] AND all[filter] NOT partial[filter])), we downloaded 111 assemblies from GenBank (https://www.ncbi.nlm.nih.gov/assembly). We added GCA_028009825.1, as GCA_028009825.2 added two 45S rDNA for chr2 and chr4 in October, which is before we run DeepVariant. We kept those with at most 7 contigs (5 or 7), removing contigs representing mitochondria and plasmids in the only 3 assemblies that featured them.

Citations:

  • Wlodzimierz et al. Cycles of satellite and transposon evolution in Arabidopsis centromeres. Nature 618, 557–565 (2023). https://doi.org/10.1038/s41586-023-06062-z
  • Hou et al. A near-complete assembly of an Arabidopsis thaliana genome. Mol Plant. 2022 Aug 1;15(8):1247-1250. doi: 10.1016/j.molp.2022.05.014. Epub 2022 Jun 1. PMID: 35655433.
  • Matthew et al. ,The genetic and epigenetic landscape of the Arabidopsis centromeres.Science374,eabi7489(2021).DOI:10.1126/science.abi7489
  • Jaegle et al. Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity. Genome Biol 24, 44 (2023). https://doi.org/10.1186/s13059-023-02875-3
cd  $DIR_BASE/assemblies/athaliana

# Partition
sbatch -c $THREADS -p allnodes -J athaliana82-vs-ref --wrap "hostname; cd /scratch; \time -v $WFMASH $DIR_BASE/assemblies/athaliana/GCA_028009825.1.fasta.gz $DIR_BASE/assemblies/athaliana/athaliana82.fa.gz -m > athaliana82.vs.GCA_028009825-1.mapping.paf -t 48 -N; mv athaliana82.vs.GCA_028009825-1.mapping.paf $DIR_BASE/assemblies/athaliana/"

# Chromosomes in GCA_028009825.1.fasta.gz are in order (from 1 to 5)
c=1
cut -f 1 GCA_028009825.1.fasta.gz.fai | while read CONTIG; do
  echo chr$c
  \time -v samtools faidx athaliana82.fa.gz $(grep "$CONTIG" -w athaliana82.vs.GCA_028009825-1.mapping.paf | cut -f 1) | bgzip -@ 48 -l 9 > athaliana82.chr$c.fa.gz && samtools faidx athaliana82.chr$c.fa.gz
  ((c++))
done 2>&1 | tee athaliana82.split-by-chr.log

E. coli

4104 assemblies in total: on 2023/11/30, specifying Status: Latests, Assembly level: Complete genome and Exclude: Exclude partial (txid562[Organism] AND (latest[filter] AND "complete genome"[filter] AND all[filter] NOT partial[filter])), we downloaded 4104 assemblies from GenBank (https://www.ncbi.nlm.nih.gov/assembly). From these, we randomly sampled 500 and 50 assemblies.

Citation:

  • Eric et al., Database resources of the national center for biotechnology information, Nucleic Acids Research, Volume 50, Issue D1, 7 January 2022, Pages D20–D26, https://doi.org/10.1093/nar/gkab1112
# Run until GCA_000597845.1 is present
samtools faidx ecoli4104.fa.gz $(grep -f <(cut -f 1 ecoli4104.fa.gz.fai -d '#' | sort | uniq | shuf | head -n 500) ecoli4104.fa.gz.fai | cut -f 1) | bgzip -@ 48 -l 9 > ecoli500.fa.gz && samtools faidx ecoli500.fa.gz
# Run until GCA_000597845.1 is present
samtools faidx ecoli500.fa.gz $(grep -f <(cut -f 1 ecoli500.fa.gz.fai -d '#' | sort | uniq | shuf | head -n 50) ecoli500.fa.gz.fai | cut -f 1) | bgzip -@ 48 -l 9 > ecoli50.fa.gz && samtools faidx ecoli50.fa.gz

H. sapiens

47 diploid assemblies (88 haplotypes) from the HPRC + GRCh38 and CHM13 reference genomes.

Citation:

  • Liao, WW., Asri, M., Ebler, J. et al. A draft human pangenome reference. Nature 617, 312–324 (2023). https://doi.org/10.1038/s41586-023-05896-x
  • Sergey et al. The complete sequence of a human genome.Science376,44-53(2022). DOI:10.1126/science.abj6987

Mouse

36 assemblies in total: on 2023/11/30, specifying Status: Latests, Assembly level: Chromosome and Exclude: Exclude anomalous (txid10090[Organism] AND (latest[filter] AND "chromosome level"[filter] AND all[filter] NOT anomalous[filter])), we downloaded 36 assemblies from GenBank (https://www.ncbi.nlm.nih.gov/assembly).

ls *fna | while read FASTA; do ACC=$(basename $FASTA .fna | cut -f 1,2 -d '_'); echo $ACC; ./rename-mice-fasta.sh $FASTA $ACC | bgzip -l 9 -@ 48 > $ACC.fasta.gz; samtools faidx $ACC.fasta.gz; done

ls *fasta.gz | while read f; do echo $f; samtools faidx $f $(grep chr19 $f.fai | cut -f 1) >> mmusculus36.chr19.fa; done
bgzip -@ 48 -l 9 mmusculus36.chr19.fa && samtools faidx mmusculus36.chr19.fa.gz

Primates

Release 2023/12/05

# https://genomeark.s3.amazonaws.com/index.html?prefix=species/
mkdir -p $DIR_BASE/assemblies/primates
cd $DIR_BASE/assemblies/primates
wget -c https://s3.amazonaws.com/genomeark/species/Gorilla_gorilla/mGorGor1/assembly_curated/mGorGor1.dip.cur.20231122.fasta.gz
wget -c https://s3.amazonaws.com/genomeark/species/Pan_paniscus/mPanPan1/assembly_curated/mPanPan1.dip.cur.20231122.fasta.gz
wget -c https://s3.amazonaws.com/genomeark/species/Pan_troglodytes/mPanTro3/assembly_curated/mPanTro3.dip.cur.20231122.fasta.gz
wget -c https://s3.amazonaws.com/genomeark/species/Pongo_abelii/mPonAbe1/assembly_curated/mPonAbe1.dip.cur.20231205.fasta.gz
wget -c https://s3.amazonaws.com/genomeark/species/Pongo_pygmaeus/mPonPyg2/assembly_curated/mPonPyg2.dip.cur.20231122.fasta.gz
wget -c https://genomeark.s3.amazonaws.com/species/Symphalangus_syndactylus/mSymSyn1/assembly_curated/mSymSyn1.dip.cur.20231205.fasta.gz

# Apply PanSN-spec
for species in mGorGor1 mPanPan1; do
    zcat $species.dip.cur.20231122.fasta.gz | \
        sed "/^>/ s/chr\(.*\)_\(mat\|pat\)_*/$species##\2##\0/; s/#mat#/M/; s/#pat#/P/" | \
        bgzip -@ 48 -l 9 > $species.fa.gz && samtools faidx $species.fa.gz
done
for species in mPanTro3 mPonPyg2; do
    zcat $species.dip.cur.20231122.fasta.gz  | \
        sed "/^>/ s/chr\(.*\)_\(hap1\|hap2\)_*/$species##\2##\0/; s/#hap1#/1/; s/#hap2#/2/" | \
        bgzip -@ 48 -l 9 > $species.fa.gz && samtools faidx $species.fa.gz
done
for species in mPonAbe1 mSymSyn1; do
    zcat $species.dip.cur.20231205.fasta.gz  | \
        sed "/^>/ s/chr\(.*\)_\(hap1\|hap2\)_*/$species##\2##\0/; s/#hap1#/1/; s/#hap2#/2/" | \
        bgzip -@ 48 -l 9 > $species.fa.gz && samtools faidx $species.fa.gz
done
rm *fasta.gz

wget -c https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/hg002v1.0.1.fasta.gz
zcat hg002v1.0.1.fasta | sed -e 's/^>chr\(.*\)_MATERNAL/>hg002#M#chr\1_MATERNAL/' \
    -e 's/^>chr\(.*\)_PATERNAL/>hg002#P#chr\1_PATERNAL/' \
    -e 's/^>chrEBV/>hg002#P#chrEBV/' \
    -e 's/^>chrM/>hg002#M#chrM/' | bgzip -@ 48 -l 9 > hg002v101.fa.gz && samtools faidx hg002v101.fa.gz
rm hg002v1.0.1.fasta.gz

zcat chm13v2.0.fa.gz grch38.fa.gz hg002v101.fa.gz mGorGor1.fa.gz mPanPan1.fa.gz mPanTro3.fa.gz mPonAbe1.fa.gz mPonPyg2.fa.gz mSymSyn1.fa.gz | bgzip -@ 48 -l 9 > primates16.20231205.fa.gz && samtools faidx primates16.20231205.fa.gz


# grch38#1#chr6	29722775	33089696  MHC
$WFMASH chm13v2.0.fa.gz mSymSyn1.fa.gz -m -t 48 -N > mSymSyn1.vs.chm13v2.mappings.N.paf
  grep 'chm13#1#chr6' mSymSyn1.vs.chm13v2.mappings.N.paf
  mSymSyn1#1#chr23_hap1   78312223        0       78312223        -       chm13#1#chr6    172126628       19925566        98237789        87      78312223        14      id:f:0.959062   kc:f:0.576732
  mSymSyn1#2#chr23_hap2   71145118        0       71145118        -       chm13#1#chr6    172126628       24314822        95459940        90      71145118        14      id:f:0.960383   kc:f:0.634831
  ...

cat \
  <(samtools faidx chm13v2.0.fa.gz chm13#1#chr6) \
  <(samtools faidx grch38.fa.gz grch38#1#chr6) \
  <(samtools faidx hg002v101.fa.gz hg002#M#chr6_MATERNAL hg002#P#chr6_PATERNAL) \
  <(samtools faidx mGorGor1.fa.gz mGorGor1#M#chr5_mat_hsa6 mGorGor1#P#chr5_pat_hsa6) \
  <(samtools faidx mPanPan1.fa.gz mPanPan1#M#chr5_mat_hsa6 mPanPan1#P#chr5_pat_hsa6) \
  <(samtools faidx mPanTro3.fa.gz mPanTro3#1#chr5_hap1_hsa6 mPanTro3#2#chr5_hap2_hsa6) \
  <(samtools faidx mPonAbe1.fa.gz mPonAbe1#1#chr5_hap1_hsa6 mPonAbe1#2#chr5_hap2_hsa6) \
  <(samtools faidx mPonPyg2.fa.gz mPonPyg2#1#chr5_hap1_hsa6 mPonPyg2#2#chr5_hap2_hsa6) \
  <(samtools faidx mSymSyn1.fa.gz mSymSyn1#1#chr23_hap1 mSymSyn1#2#chr23_hap2) | \
  bgzip -@ 48 -l 9 > primates16.hsa6.fa.gz && samtools faidx primates16.hsa6.fa.gz

S. cerevisiae

Download 142 assemblies from:

  • O’Donnell et al. Telomere-to-telomere assemblies of 142 strains characterize the genome structural landscape in Saccharomyces cerevisiae. Nat Genet 55, 1390–1399 (2023). https://doi.org/10.1038/s41588-023-01459-y

Soy

Download 38 assemblies from:

  • Liu et al. Pan-Genome of Wild and Cultivated Soybeans. Cell. 2020 Jul 9;182(1):162-176.e13. doi: 10.1016/j.cell.2020.05.023
  • Chu et al. Eight soybean reference genome resources from varying latitudes and agronomic traits. Sci Data. 2021;8,164. doi: 10.1038/s41597-021-00947-2
cut -f 3 $DIR_BASE/data/soy37.urls.txt | parallel -j 4 'wget -q {} && echo got {}'

rm soy37.chr*.fa*
for i in WM82 ZH13 C01 C02 C03 C04 C05 C06 C07 C08 C09 C10 C11 C12 C13 C14 L01 L02 L03 L04 L05 L06 L07 L08 L09 W01 W02 W03 W05 GmWF7 GmHF25 GmZH35 GmZH13 GmJY GmHX3 GmW82 GsojaF; do 
    echo $i

    file=$(grep $i -w $DIR_BASE/data/soy37.urls.txt | cut -f 3 | rev | cut -f 1 -d '/' | rev)
    for c in {1..20}; do
        start=`zgrep -n "^>" $file | head -n$c | tail -n1 | cut -d: -f1`
        end=`zgrep -n "^>" $file | head -n$((c+1)) | tail -n1 | cut -d: -f1`
        zcat $file | \
            head -n $((end-1)) | \
            tail -n $((end-start)) | \
            sed "s/^>.*$/>$i#1#chr$c/" >> /scratch/soy37.chr$c.fa
    done
done
rm *a.gz
for c in {1..20}; do
    echo chr$c

    bgzip -@ 48 -l 9 /scratch/soy37.chr$c.fa
    samtools faidx /scratch/soy37.chr$c.fa.gz
done

mv /scratch/soy37.chr*.fa* $DIR_BASE/assemblies/soy

cd $DIR_BASE/assemblies/soy
seq 1 20 | while read f; do echo $f; zcat soy37.chr$f.fa.gz >> soy37.fa; done
bgzip -@ 48 -l 9 soy37.fa && samtools faidx soy37.fa.gz

Tomato

Download 23 assemblies from:

cut -f 2 $DIR_BASE/data/tomato23.urls.tsv | parallel -j 4 'wget -q {} && echo got {}'

ls *fasta.gz | while read f; do
  NAME=$(basename $f .fasta.gz | cut -f 1 -d '.');
  echo $NAME;
  zcat $f | sed "s/^>/>$NAME#1#chr/g" | bgzip -@ 48 -l 9 > $NAME.fa.gz;
  samtools faidx $NAME.fa.gz
done
rm *fasta.gz

seq 1 12 | while read i; do
  echo chr$i
  rm tomato23.chr$i.fa
  ls *fa.gz | grep chr -v | while read FASTA; do
    echo $FASTA
    samtools faidx $FASTA $(cut -f 1 $FASTA.fai | grep chr$i -w) >> tomato23.chr$i.fa
  done
  bgzip -@ 48 -l 9 tomato23.chr$i.fa && samtools faidx tomato23.chr$i.fa.gz
done

seq 1 12 | while read f; do echo $f; zcat tomato23.chr$f.fa.gz >> tomato23.fa; done
bgzip -@ 48 -l 9 tomato23.fa && samtools faidx tomato23.fa.gz

Tools

mkdir -p ~/tools $$ cd ~/tools

git clone --recursive https://github.com/waveygang/wfmash
cd wfmash
git checkout master && git pull && git submodule update --init --recursive
git checkout 0b191bb84ffdfd257354c1aa82a7f1e13dc536d0
cmake -H. -Bbuild && cmake --build build -- -j 48
cp build/bin/wfmash build/bin/wfmash-0b191bb84ffdfd257354c1aa82a7f1e13dc536d0
cd ..

git clone --recursive https://github.com/ekg/seqwish.git
cd seqwish
git checkout master && git pull && git submodule update --init --recursive
git checkout f44b402f0c2e02988d431d9b2e5eba9727cf93a9
rm -rf build/
cmake -H. -Bbuild && cmake --build build -- -j 48
cp bin/seqwish bin/seqwish-f44b402f0c2e02988d431d9b2e5eba9727cf93a9
cd ..

git clone --recursive https://github.com/pangenome/smoothxg.git
cd smoothxg
git checkout master && git pull && git submodule update --init --recursive
git checkout ae949a1053fb3d7c7fb7bdf358aefcbcbd8073a4
rm -rf build/
cmake -H. -Bbuild && cmake --build build -- -j 48
cp bin/smoothxg bin/smoothxg-ae949a1053fb3d7c7fb7bdf358aefcbcbd8073a4
cd ..

git clone --recursive https://github.com/pangenome/odgi.git
cd odgi
git checkout master && git pull && git submodule update --init --recursive
git checkout 861b1c04f5622c5bb916a161c1abe812c213f1a5
rm -rf build/
cmake -H. -Bbuild && cmake --build build -- -j 48
cp bin/odgi bin/odgi-861b1c04f5622c5bb916a161c1abe812c213f1a5
cd ..

git clone https://github.com/marschall-lab/GFAffix.git
cd GFAffix
git checkout main && git pull && git submodule update --init --recursive
git checkout d630eb7d9827340f5f292e57cb3cb5e31e6f86f0
env -i bash -c 'PATH=:/usr/local/bin:/usr/bin:/bin ~/.cargo/bin/cargo build --release'
cp target/release/gfaffix target/release/gfaffix-d630eb7d9827340f5f292e57cb3cb5e31e6f86f0
cd ..

git clone --recursive https://github.com/pangenome/pggb.git
cd pggb
git checkout master && git pull && git submodule update --init --recursive
git checkout 13482bd06359a7ad8e3d3e0dd6eb6d9399f26046
cp pggb pggb-x
sed 's,"$fmt" wfmash,"$fmt" ~/tools/wfmash/build/bin/wfmash-0b191bb84ffdfd257354c1aa82a7f1e13dc536d0,g' pggb-x -i
sed 's,$(wfmash,$(~/tools/wfmash/build/bin/wfmash-0b191bb84ffdfd257354c1aa82a7f1e13dc536d0,g' pggb-x -i
sed 's,"$fmt" seqwish,"$fmt" ~/tools/seqwish/bin/seqwish-f44b402f0c2e02988d431d9b2e5eba9727cf93a9,g' pggb-x -i
sed 's,"$fmt" smoothxg,"$fmt" ~/tools/smoothxg/bin/smoothxg-ae949a1053fb3d7c7fb7bdf358aefcbcbd8073a4,g' pggb-x -i
sed 's,"$fmt" odgi,"$fmt" ~/tools/odgi/bin/odgi-861b1c04f5622c5bb916a161c1abe812c213f1a5,g' pggb-x -i
sed 's,"$fmt" gfaffix,"$fmt" ~/tools/GFAffix/target/release/gfaffix-d630eb7d9827340f5f292e57cb3cb5e31e6f86f0,g' pggb-x -i
mv pggb-x pggb-13482bd06359a7ad8e3d3e0dd6eb6d9399f26046
cp partition-before-pggb partition-before-pggb-x
sed 's,"$fmt" wfmash,"$fmt" ~/tools/wfmash/build/bin/wfmash-0b191bb84ffdfd257354c1aa82a7f1e13dc536d0,g' partition-before-pggb-x -i
sed 's,$(wfmash,$(~/tools/wfmash/build/bin/wfmash-0b191bb84ffdfd257354c1aa82a7f1e13dc536d0,g' partition-before-pggb-x -i
sed 's,"$fmt" seqwish,"$fmt" ~/tools/seqwish/bin/seqwish-f44b402f0c2e02988d431d9b2e5eba9727cf93a9,g' partition-before-pggb-x -i
sed 's,"$fmt" smoothxg,"$fmt" ~/tools/smoothxg/bin/smoothxg-ae949a1053fb3d7c7fb7bdf358aefcbcbd8073a4,g' partition-before-pggb-x -i
sed 's,"$fmt" odgi,"$fmt" ~/tools/odgi/bin/odgi-861b1c04f5622c5bb916a161c1abe812c213f1a5,g' partition-before-pggb-x -i
sed 's,"$fmt" gfaffix,"$fmt" ~/tools/GFAffix/target/release/gfaffix-d630eb7d9827340f5f292e57cb3cb5e31e6f86f0,g' partition-before-pggb-x -i
mv partition-before-pggb-x partition-before-pggb-13482bd06359a7ad8e3d3e0dd6eb6d9399f26046
cd ..

wget -c https://github.com/vgteam/vg/releases/download/v1.53.0/vg
chmod +x vg

git clone --recursice https://github.com/pangenome/vcfbub.git
cd vcfbub
git checkout 26a1f0cb216a423f8547c4ad0e0ce38cb9d324b9 # version 0.1.0
cargo build --release

/gnu/store/79r54wk4p3705dk89jg9hidyvf4754jp-vcflib-1.0.3+fdcdaad-10/bin/vcfwave

# Minigraph-Cactus
cd /lizardfs/guarracino/tools
wget -c https://github.com/ComparativeGenomicsToolkit/cactus/releases/download/v2.7.0/cactus-bin-v2.7.0.tar.gz
tar -xzf cactus-bin-v2.7.0.tar.gz && rm cactus-bin-v2.7.0.tar.gz
cd cactus-bin-v2.7.0

# guix install python-virtualenv
virtualenv -p python3 venv-cactus-v2.7.0
printf "export PATH=$(pwd)/bin:\$PATH\nexport PYTHONPATH=$(pwd)/lib:\$PYTHONPATH\n" >> venv-cactus-v2.7.0/bin/activate
source venv-cactus-v2.7.0/bin/activate
python3 -m pip install -U setuptools pip wheel
python3 -m pip install -U .
python3 -m pip install -U -r ./toil-requirement.txt

cactus ./jobstore ./examples/evolverMammals.txt ./evolverMammals.hal --realTimeLogging # test

# Panacus
git clone https://github.com/marschall-lab/panacus.git
cd panacus
git checkout main && git pull && git submodule update --init --recursive
git checkout bd492f54c05367d0fc5a2c3fb9bf23260ac8379e
env -i bash -c 'PATH=:/usr/local/bin:/usr/bin:/bin ~/.cargo/bin/cargo build --release'
cp target/release/panacus target/release/panacus-bd492f54c05367d0fc5a2c3fb9bf23260ac8379e
cd ..

sudo apt-get update
sudo apt-get install libxcb-render0-dev libxcb-shape0-dev libxcb-xfixes0-dev
git clone --recursive https://github.com/chfi/gfaestus.git
cd gfaestus
cargo build --release

conda create --prefix /lizardfs/guarracino/condatools/edta/2.1.0/ -c conda-forge -c bioconda edta=2.1.0 -y
git clone --recursive https://github.com/oushujun/EDTA.git