Pipeline B

This workflow is used for detecting potential horizontal transfer events, starting from a set of curated repeat consensus sequences from available sources (e.g. RepBase or Dfam) for the TE of interest.

It is a simplified version of the code used to infer horizontal transfer events involving L1 and BovB retrotransposons in eukaryotes, as described here: https://github.com/AdelaideBioinfo/horizontalTransfer

Requirements:

BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi)
LASTZ (https://www.bx.psu.edu/~rsharris/lastz/)
SAMtools (http://www.htslib.org/)
BEDtools (http://bedtools.readthedocs.io/en/latest/)
CENSOR, which requires wu-blast and bioperl (https://girinst.org/downloads/software/censor/)
VSEARCH (https://github.com/torognes/vsearch)
USEARCH (https://www.drive5.com/usearch/)

Prerequisites

Some level of familiarity with computers and queuing systems (e.g. SLURM).

Test dataset

A test genome (fungus Yarrowia lipolytica) has been placed in test_genome, along with a set of L1 repeats as a test_query. We recommend trying out the workflow using these files first.

Usage

0) Download and prepare the genomes you want to screen.

0a) Append species names to the genome name.

As the names given to genome assemblies are not usually informative, you will want to append species names to the genome names.

Run 0a_rename_genome.sh. Example usage:

GENOME=<source_genome> SPECIES=<species_name> bash 0a_rename_genome.sh

0b) Make each genome a BLAST database and create indexes.

Run 0b_make_database_and_index.sh. Example usage:

bash 0b_make_database_and_index.sh

1) BLAST TE of interest against all available genomes.

1a) Use TBLASTN with protein sequence queries.

This will identify similar TEs in distantly related species. Output will be nucleotide sequences. Run 1a_tblastn_and_extract.sbatch. Example usage:

DIR=test_genome DATABASE=YarrowiaLipolytica_ASM252v1.fa QUERY=L1_ORFp.fasta RESULTSDIR=results sbatch 1a_tblastn_and_extract.sbatch

1b) (Optional) Use BLASTN or LASTZ with nucleotide sequence queries.

Run 1b_lastz_and_extract.sbatch. Example usage:

GENOMEDIR=test_genome GENOME=YarrowiaLipolytica_ASM252v1.fa QUERYDIR=test_query QUERY=L1_nucl_seqs.fasta RESULTSDIR=results sbatch 1b_lastz_and_extract.sbatch

1c) For each genome, combine all identified nucleotide sequences from the previous steps.

Run 1c_combine_hits.sbatch. Example usage:

SPECIES=YarrowiaLipolytica ELEMENT=L1 LASTZFILE=YarrowiaLipolytica_ASM252v1.fa_L1_nucl_seqs.fasta_lastz.bed TBLASTNFILE=YarrowiaLipolytica_ASM252v1.fa_L1_ORFp.fasta_merged.bed GENOME=YarrowiaLipolytica_ASM252v1.fa RESULTSDIR=results sbatch 1c_combine_hits.sbatch

1d) Add header annotations to indicate the genome that each sequence was derived from.

Run 1d_append_name_to_headers.sh. Example usage:

SPECIES=YarrowiaLipolytica ELEMENT=L1 RESULTSDIR=results bash 1d_append_name_to_headers.sh

Repeat screening in an iterative process (e.g. BLAST-ing the new, larger, query dataset against each genome and then combining the output) until no new hits are found.

2) Perform a reciprocal best hit check.

2a) Use CENSOR to compare hits against known repeat databases (e.g. RepBase or Dfam).

Run 2a_censor_sequences.sbatch. Example usage:

INDIR=results FILE=YarrowiaLipolytica_L1_combined.fasta OUTDIR=results/censored sbatch 2a_censor_sequences.sbatch

2b) Confirm and extract hits that match the correct TE family.

Run 2b_check_censor_output.sbatch. Example usage:

SPECIES=Yarrowia.lipolytica FILE=Yarrowia.lipolytica_L1_combined.fasta GENOME=test_genome/YarrowiaLipolytica_ASM252v1.fa ELEMENT=L1 QUERY=test_query/known_L1_elements_from_repbase.txt CENSORDIR=results/censored sbatch 2b_check_censor_output.sbatch

3) Cluster all sequences obtained from the iterative alignment screening.

Prior to this step, you will need to combine hits from all genomes into one file. Make sure that sequence headers indicate the species that each TE sequence was derived from.

This clustering step is important as it will reveal likely HTT events which are manifested as clusters of highly similar elements that include elements from multiple species. We have found it best to use sequence divergence cut offs that cluster most closely related sequences (e.g. <20% divergent).

3a) All-against-all clustering of nucleotide sequences using VSEARCH.

You can use full-length nucleotide sequences, or nucleotide sequences of the open reading frames only.

Run 3a_vsearch_cluster_for_nucleotide_seqs.sbatch, changing the clustering identity threshold (ID) as required. Example usage:

INDIR=results/allSpeciesCombined FILE=allSpecies_L1.fasta ID=80 PREFIX=c sbatch 3a_vsearch_cluster_for_nucleotide_seqs.sbatch

3b) All-against-all clustering of amino acid sequences using USEARCH.

VSEARCH (the open source alternative to USEARCH) does not support protein sequences, but will not fail if given protein sequence input. Make sure you use another program (e.g. Cd-hit or USEARCH) to clustering amino acid sequences. The 32-bit version of USEARCH is open source.

The following script can be used to perform all-against-all cluserting of amino acid sequences from ORFs, or reverse transcriptase domains from retrotransposons/transposase domains from DNA transposons.

Run 3b_usearch_cluster_for_aa_seqs.sbatch. Example usage:

INDIR=results/allSpeciesCombined FILE=allSpecies_L1_ORFp.fasta ID=90 PREFIX=r sbatch 3b_usearch_cluster_for_aa_seqs.sbatch

Finally, identify clusters containing TE sequences from multiple species (e.g. based on the sequence header names). These clusters are the HTT candidates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipeline B

Requirements:

Prerequisites

Test dataset

Usage

0) Download and prepare the genomes you want to screen.

0a) Append species names to the genome name.

0b) Make each genome a BLAST database and create indexes.

1) BLAST TE of interest against all available genomes.

1a) Use TBLASTN with protein sequence queries.

1b) (Optional) Use BLASTN or LASTZ with nucleotide sequence queries.

1c) For each genome, combine all identified nucleotide sequences from the previous steps.

1d) Add header annotations to indicate the genome that each sequence was derived from.

2) Perform a reciprocal best hit check.

2a) Use CENSOR to compare hits against known repeat databases (e.g. RepBase or Dfam).

2b) Confirm and extract hits that match the correct TE family.

3) Cluster all sequences obtained from the iterative alignment screening.

3a) All-against-all clustering of nucleotide sequences using VSEARCH.

3b) All-against-all clustering of amino acid sequences using USEARCH.

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
test_genome		test_genome
test_query		test_query
0a_rename_genome.sh		0a_rename_genome.sh
0b_make_database_and_index.sh		0b_make_database_and_index.sh
1a_tblastn_and_extract.sbatch		1a_tblastn_and_extract.sbatch
1b_lastz_and_extract.sbatch		1b_lastz_and_extract.sbatch
1c_combine_hits.sbatch		1c_combine_hits.sbatch
1d_append_name_to_headers.sh		1d_append_name_to_headers.sh
2a_censor_sequences.sbatch		2a_censor_sequences.sbatch
2b_check_censor_output.sbatch		2b_check_censor_output.sbatch
3a_vsearch_cluster_for_nucleotide_seqs.sbatch		3a_vsearch_cluster_for_nucleotide_seqs.sbatch
3b_usearch_cluster_for_aa_seqs.sbatch		3b_usearch_cluster_for_aa_seqs.sbatch
README.md		README.md

atmaivancevic/Pipeline_B

Folders and files

Latest commit

History

Repository files navigation

Pipeline B

Requirements:

Prerequisites

Test dataset

Usage

0) Download and prepare the genomes you want to screen.

0a) Append species names to the genome name.

0b) Make each genome a BLAST database and create indexes.

1) BLAST TE of interest against all available genomes.

1a) Use TBLASTN with protein sequence queries.

1b) (Optional) Use BLASTN or LASTZ with nucleotide sequence queries.

1c) For each genome, combine all identified nucleotide sequences from the previous steps.

1d) Add header annotations to indicate the genome that each sequence was derived from.

2) Perform a reciprocal best hit check.

2a) Use CENSOR to compare hits against known repeat databases (e.g. RepBase or Dfam).

2b) Confirm and extract hits that match the correct TE family.

3) Cluster all sequences obtained from the iterative alignment screening.

3a) All-against-all clustering of nucleotide sequences using VSEARCH.

3b) All-against-all clustering of amino acid sequences using USEARCH.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages