NLR-Assembler is a command-line tool designed to combine contigs from de novo assembled RenSeq data into groups that represent sequence information of full-length genes, using barcodes from linked-read sequencing by 10x Genomics. The final assembly generated by NLR-Assembler is designed to be used with the MutantHunter Pipeline to aid in the discovery of novel disease resistance genes. In this repository, the documentation for NLR-Assembler tool and the associated pipeline are described. If would you would like more information on how NLR-Assembler works you can request a copy of my dissertation here.
PLEASE NOTE THAT THIS SOFTWARE IS HIGHLY EXPERIMENTAL.
Download Python (3.10) and the NLR-Assembler repository. All dependencies for NLR-Assembler are shown in the requirements.txt and can be installed using pip or anaconda. We recommend using an anaconda environment to generate the complete environment for NLR-Assembler.
The following commands can be used to setup an environment and install the requirements using anaconda and pip:
conda create -n NLR-Assembler_ENV python=3.10
source activate NLR-Assembler_ENV
pip3 install -r requirements.txt
Four commands are currently available using NLR-Assembler: index, group, contig-coverage and nlr-coverage. All can be running using the following generic command (once the conda environment has been activated):
source activate NLR-Assembler_ENV
python3 main.py <command> ...
These commands are used at specific points in the NLR-Assembler pipeline (see below).
The index command generates an index of all barcoded reads where each read is assigned a colour (RGB value) according to its barcode. Reads with the same barcode are assigned the same colour in the index. Reads with no barcode matching the whitelist provided are assigned the colour black (RGB value : 0,0,0). Sequencing errors in barcodes are automatically corrected whilst generating the output. Note: index is designed to run using multiple cores (we used 12).
The final output is a csv file named 'barcode_index.csv'.
Run the index command as follows:
python3 main.py index --fastq input.fasta --adapters whitelist.txt --cores 12 --split 1000000
parameter | argument | description |
---|---|---|
--fastq | input.fasta | The raw barcoded reads in fastq format (Forward reads (R1) for 10x genomics data) |
--adapters | whitelist.txt | A whitelist of all valid adapter sequences used by 10x Genomics available here in txt format. |
--cores | 12 | The number of cores utilised as an integer (we recommend 12). If left blank all available cores will be used. |
--split | 1000000 | The number of sequences per core (default is 1000000). |
The group command generates an assembly by groupings contigs from a standard assembly that have been assembled from reads with similar sets of barcodes. A barcode profile is generated for each contig in the assembly from mapping of the reads to the assembly. The barcode profiles for each contig are weighted using Term-Frequency Inverse Document Frequency and used to generate a cosine similarity matrix comparing all contigs (1). Contigs with a high cosine similarity are grouped together and combined into a single contig, separated by a spacer of ambiguous nucleotides to produce the final assembly(2).
The final output is a fasta file named grouped_assemblies.fa
python3 main.py group --samfile mapping.sam --index barcode_index.csv --assembly assembly.fasta --blast contig_bait.blastn
parameter | argument | description |
---|---|---|
--samfile | mapping.sam | Raw reads mapped to the assembly with PCR duplicates removed in sam format. |
--index | barcode_index.csv | The index file generated using the index command (see above). |
--assembly | assembly.fasta | A "draft" assembly generated using de-barcoded reads in fasta format. |
--blast | contig_bait.blastn | An alignment of RenSeq baits used to generate the raw reads to the generic assembly in BLAST6 format. |
The specific details for generating each file are explained in the NLR-Assembler Pipeline section.
The contig coverage command determines the region covered by a set of contigs that have been grouped together in the final assembly. This requires a BLAST alignment of the draft assembly against a reference genome, the details of this step are described in the NLR Assembler Pipeline section and it is illustrated below. The chromosome most likely to belong to each contig group is then determined from the alignments and region covered all contigs in the group is calculated (see: step 2 below).
The final output is a csv file containing information on each contig group. In addition, the percentage of grouped contigs that cover a region of 60 kb and 1 mb are logged in the standard output.
python3 main.py contig-coverage -assembly grouped_assemblies.fa -blast contig_coverage.blastn
parameter | argument | description |
---|---|---|
--assembly | grouped_assemblies.fa | NLR-Assembler assembly in fasta format (output of the group command shown above) |
--blast | contig_coverage.blastn | An alignment of the NLR-Assembler assembly to a reference genome in BLAST6 format. |
The specific details for generating each file are explained in the NLR-Assembler Pipeline section.
The NLR coverage command individually calculates the average percentage of all NLR loci (annotated from a reference genome) covered by the draft and final assemblies. The total number of contigs in each assembly is also counted. These values are then compared to determine how NLR-Assembler restructures the draft assembly.
python3 main.py nlr-coverage --draft draft_nlr_coverage.blastn \
--final final_nlr_coverage.blastn \
--nlr nlr_sequences.fasta
The final output is a TSV file that describes: the total number of contigs in each assembly, the average percentage of NLR loci covered by each assembly and the difference between the assemblies for both metrics.
parameter | argument | description |
---|---|---|
--draft | draft_nlr_coverage.blastn | An alignment of NLR sequences annotated from a reference genome to the draft assembly in BLAST6 format. |
--final | final_nlr_coverage.blastn | An alignment of NLR sequences annotated from a reference genome to the final assembly in BLAST6 format. |
--nlr | nlr_sequences.fasta | NLR sequences annotated from a reference genome using NLR-Annotator and converted to a single-line FASTA file. |
The specific details for generating each file are explained in the NLR-Assembler Pipeline section.
Available to download from https://www.python.org/downloads/release/python-3109/
Available to download from this repository (see documentation above).
Available to download from http://bio-bwa.sourceforge.net/
Available to download from http://samtools.sourceforge.net/
Availavle to download from NCBI
Available to download from 10x Genomics
A widerange of commercial and open-source de novo assemblers can be used to produce an assembly from raw data for the pipeline. We found ABySS (v2.3.5) to be the most effective tool for our data.
Available to download from https://bioinf.shenwei.me/seqkit/download/
Availavble to download from https://github.com/steuernb/NLR-Annotator
The sequences of the RenSeq baits used for your specific experiments
A whitelist of all valid adapter sequences used by 10x Genomics for linked-read sequencing is available here.
Debarcoded and subsampled (~4% of data) as shown below:
longranger basic --id=processed --fastqs=raw_data_directory
gunzip processed/outs/barcoded.fastq.gz
sed -n '1~200p;2~200p;3~200p;4~200p;5~200p;6~200p;7~200p;8~200p' processed/outs/barcoded.fastq > processed_reads.fastq
Generate a draft assembly using the processed reads and an assembler of your choice. For our data, we had the best success with ABySS and the following parameters. Then renumber the contigs in the draft assembly using seqkit and convert the file from a multi-line fasta file to a single line fasta file.
abyss-pe name=assembly k=85 B=100G kc=3 H=3 v=-v pe='pea' pea=processed_reads.fastq lr='lra' lra=processed_reads.fastq
seqkit replace -p '.+' -r '{nr}' assembly-contigs.fa -o renumbered_assembly-contigs.fa
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < renumbered_assembly-contigs.fa > flat_renumbered_contigs.fa
tail -n +2 flat_renumbered_contigs.fa > draft_assembly.fa
Use the NLR-Assembler index command (see above) to generate an index the raw barcoded reads (Not the reads processed by Long Ranger Basic).
python3 main.py index --fastq Raw_R1_reads.fasta --adapters whitelist.txt --cores 12 --split 1000000
Align RenSeq baits from your experiment to the draft assembly using BLAST+.
blastn -query renseq_baits.fasta -subject draft_assembly.fasta -outfmt 6 -out bait_assembly_alignment.blastn
Map the (cleaned) raw reads used for de novo assembly against your draft assembly before removing PCR duplicates from the mapping. We recommend using bwa and samtools.
bwa index draft_assembly.fasta
bwa mem draft_assembly.fasta processed_reads.fastq > mapping.sam
samtools view -Sh mapping.sam > filtered_mapping.sam
samtools sort -n filtered_mapping.sam > mapping.sorted.bam
samtools fixmate -m mapping.sorted.bam fixmate.bam
samtools sort fixmate.bam > fixmate.sorted.bam
samtools markdup -r -s fixmate.sorted.bam markdup.bam
samtools view -Sh markdup.bam > markdup.sam
Use the NLR-Assembler group command (see above) to generate the final assembly using the bait alignment, read mapping, draft assembly and index generated during preprocessing.
python3 main.py group --samfile mapping.sam \
--index barcode_index.csv \
--assembly draft_assembly.fasta \
--blast bait_assembly_alignment.blastn
The final assembly can then be used in place of the standard wildtype assembly used in the MutantHunter Pipeline to identify novel resistance gene candidates.
If a reference genome is available NLR and contig coverage can be used to assess the quality of the final assembly using NLR-Assembler. Contig coverage determines the size of the region covered by a set of contigs grouped together by NLR-Assembler. NLR coverage determines the average percentage of annotated NLR sequences covered by the draft and final assemblies respectively.
Identify all NLRs sequences in the reference genome using NLR-Annotator and convert the output sequences from a multi-line to a singleline fasta file. Align the draft and final assemblies against the identified NLR sequences from the reference genome using BLAST. Use the NLR-assembler NLR coverage command (see above) to analyse the NLR coverage of the draft and final assemblies.
NLR-Annotator -i reference_genome.fa \
-x mot.txt \
-y store.txt \
-o nlr_annotations.txt \
-m nlr_annotations.motifs.bed \
-f reference_genome.fa nlr_annotations.fa 0
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < nlr_annotations.fa > multiline_nlr_annotations.fa
tail -n +2 multiline_nlr_annotations.fa > reformatted_nlr_annotations.fa
blastn -max_target_seqs 1 -query draft_assembly.fasta \
-subject reformatted_nlr_annotations.fa \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen" \
-out draft_assembly_nlr_coverage.blastn
blastn -max_target_seqs 1 -query final_assembly.fasta \
-subject reformatted_nlr_annotations.fa \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen" \
-out final_assembly_nlr_coverage.blastn
python3 main.py nlr-coverage -b draft_assembly_nlr_coverage.blastn \
-c final_assembly_nlr_coverage.blastn \
-n reformatted_nlr_annotations.fa
When analysing the output of NLR coverage In theory, there should only be a decrease in the total number of contigs not the average percentage of each NLR locus covered in the final assembly compared to the draft assembly in the final output. However, in practice NLR coverage also decreases as NLRs are occasionally grouped into gene clusters rather than individual genes. This effect is not currently accounted for by NLR coverage and causes the observed decrease in coverage.
Align the draft assembly to the reference genome using BLAST and use the NLR-Assembler query coverage command (see above) to determine the contig coverage of contigs grouped together in the final assembly.
To determine contig coverage an alignment of the draft assembly to a reference genome must first be generated using BLAST. Next use the contig coverage command with the final assembly and the previously generated alignment to determine the size of the regions covered by contigs grouped together in the final assembly.
blastn -query draft_assembly.fasta -subject reference_genome.fasta \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen" \
-out query_coverage.blastn
python3 main.py query-coverage -b query_coverage.blastn -c final_assembly.fasta
In theory, the majority of contigs should be a region smaller than the size of the genomic DNA fragments used for linked-read sequencing. E.g ~60 kb DNA fragments were sequenced, 95% of contig that were grouped together by NLR-Assembler should cover a region of 60 kb or less.
-
This software and pipeline are highly experimental use at your own risk (please contact us first if you are interested!)
-
Linked-read sequencing has been discontinued by 10x Genomics. However, alternative technologies are still available but would NLR-Assembler to be refactored for the specific dataset (But I'd be happy to collaborate on a project!).
-
Evaluation metrics for NLR-Assembler are very rudimentary and struggle to capture the complete complexity of the final assembly.
-
Parameters for generating the draft assembly with ABySS and grouping of contigs by the NLR-Assembler group command have not been optimised. Tweak these parameters as necessary for your specific dataset.
-
I can provide a copy of my dissertation which provides a more in-depth explanation of NLR-Assembler upon request here
A massive thank you to Dr. Burkhard Steuernagel and Dr. Brande Wulff who co-supervised my dissertation project and provided all the data and resources requried to complete the project!
If you wish to discuss NLR-Assembler, raise issues or collaborate please contact me here