Skip to content

inrae/GenomAsm4pg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Asm4pg is an automatic and reproducible genome assembly workflow designed for pangenomic applications using PacBio HiFi data.

The official MSpangepop repo can be found at the INRAE forge.
A GitHub mirror can be found at INRAE GitHub.

This workflow leverages Snakemake for efficient genome assembly and generates an HTML report summarizing key assembly statistics.

Asm4pg can assamble in :

  • HiFi mode (default)
    Performs primary genome assembly using high-fidelity long reads.

  • hi-c mode
    Uses Hi-C data to scaffold the assembled contigs into chromosome-scale scaffolds.

  • trio mode
    Uses parental short reads to partition long reads by haplotype before assembly.

  • ont mode
    Uses ultra long reads (in fastq or bam format, no fasta) to perform the assembly

  Workflow flowchart
Animated version

📂 Repository Structure

├── README.md
├── asm4pg  # <- The running script
├── doc
├── workflow
│   ├── scripts
|   └── Snakefile
└──  .config
    ├── snakemake_profile
    └── masterconfig.yaml # <- Your configuration

✅ Requirements

  • Miniforge/conda (for Snakemake>=8.4.7 and the SLURM plugin)
  • Singularity/Apptainer (for containerized execution)

Note: All external tools are automatically managed by Snakemake and will be downloaded as Singularity/Apptainer images (~6GB total).


🚀 How to Use (quick guide)

1. Set up

Clone the Git repository

git clone https://forge.inrae.fr/asm4pg/GenomAsm4pg/ && cd GenomAsm4pg && mkdir slurm_logs
  • Create an environement for snakemake (from the provided envfile):
conda env create -n wf_env -f .config/wf_env.yaml

Use Miniforge with the conda-forge channel, see why here (french)

  • Update the asm4pg file with the correct paths to Singularity/Apptainer modules lines 45-46
nano asm4pg

You can configure this file for multiple servers using the case statement (see the example for genotoul HPC line 33)

2. Configure the pipeline for your data

  • Edit the masterconfig file in the .config/ directory with your sample information.
nano .config/masterconfig.yaml
  • Here you can add the path to your long reads file (fasta.gz, fasta, fastq.gz, fastq, or bam)
  • Update the path to the output directory parent directory
  • We advise keeping the default options for the first run.

Example config :

samples:       
  example1:              # <- First indent = Name of the assembly
    reads: example-1.fasta.gz    # <- Second indent = All options
    busco_lineage: insecta_odb10
  example2: 
    reads: example-2.fasta.gz
    busco_lineage: eudicots_odb10 # Options only affect current assembly

3. Run the workflow

  • Run the workflow :
sbatch asm4pg dry # Check for warnings
sbatch asm4pg run # Then

Nb : If your account name can't be automatically determined, add it in the .config/snakemake/profiles/slurm/config.yaml file.

Nb : Use the command squeue --format="%.10i %.9P %.6j %.10k %.8u %.2t %.10M %.6D %.20R" -A $user to see job names

⚙️ Other runing options

asm4pg [dry|run|local-run|dag|rulegraph|unlock|touch] [additional snakemake args]
    dry - run in dry-run mode
    run - run the workflow with SLURM
    local-run - run the workflow localy (on a single node)
    dag - generate the directed acyclic graph for the workflow
    rulegraph - generate the rulegraph for the workflow
    unlock - Unlock the directory if snakemake crashed
    touch - Tell snakemake that all files are up to date (use with caution)
    [additional snakemake args] - for any snakemake arg, like --until hifiasm

🔧 Using the full potential of the workflow :

Asm4pg has many options. If you wish to modify the default values and know more about the workflow, please refer to the documentation

Output of the workflow :

└── sample
    └── results
        ├── 00_converted_input
        ├── 01_raw_assembly
        │   ├── sample.fasta.gz
        │   └── sample.gfa
        ├── 02_final_assembly
        │   ├── hap1/hap2 
        │   │   ├── sample.fasta.gz # <- The final assembly
        │   │   └── ragtag_scafold
        ├── 03_raw_data_qc
        │   ├── genometools
        │   ├── genomescope
        │   └── jellyfish
        ├── 04_assembly_qc
        │   ├── hap1/hap2
        │   │   ├── genometools
        │   │   ├── busco
        │   │   ├── katplot
        │   │   ├── LTR/LAI
        │   │   └── telomeres
        │   ├── merqury
        │   │   ├── ...
        │   │   └── meryl_database.meryl
        │   └── quast
        ├── final_report.html # <- The final report
        ├── benchmark
        └── logs

📜 How to cite asm4pg?

Waiting for the publication, you can cite asm4pg as follow:

Denni S*, Piat L*, Bouallegue S, Tran J, Smith K, Wu C, Klopp C, Bui QT, Duvaux L. Asm4pg: a workflow for efficient long-read genome assembly for pangenomics (In prep.). https://forge.inrae.fr/asm4pg/GenomAsm4pg/

* This authors contributed equally to this work.

License

The content of this repository is licensed under (GNU GPLv3)

✉️ Contacts

For any troubleshooting, issue or feature suggestion, please use the issue tab of this repository. For any other question or if you want to help in developing asm4pg, please contact Ludovic Duvaux at ludovic.duvaux@inrae.fr

About

Asm4pg is an automatic and reproducible genome assembly workflow for PacBio HiFi reads.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •