Motivation
We are developing and providing tools to analyse CRISPR arrays in microbial genome sequences. Our tools detect CRISPR arrays (CRISPRDetect) and then detect targets of CRISPR-array spacers in viral, plasmid and other sequences (CRISPRTarget). The increasing popularity of CRISPRTarget has prompted us to provide a standalone version for download, and a version through the galaxy platform. Matches between genomic spacers and plasmids or viral genomes can be used to predict the microbial host of nucleic acid sequences such as mobile genetic elements and antibiotic resistance genes. Our tool CRISPRHost adapted this approach to predict hosts.
Dependencies
All dependencies are provided and are in the bin directory:
- bedtools
- blastdbcmd
- blastn
- esl-shuffle
- makeblastdb
- samtools
Run the "chmod" command to make them executable:
chmod -R 777 bin
Use case 1
An user generated a CRISPRDetect-formatted GFF file of CRISPR arrays using CRISPRDetect, and a set of genomic sequences to be used as DB. The user can call this command to build a BLASTDB and generate an index file for the genomic sequences. Then dinucleotide-shuffle the genomic sequences, build another BLASTDB and compute another index file for the shuffle sequences. Then search for the CRISPR-spacer target and compute P-value using the shuffled BLASTDB. The user can then save the genomic and the shuffled BLASTDBs and index files for later use.
perl CRISPRTarget.pl -gff sample_crispr_gff/PSA.crispr.gff -user_fasta sample_db/vhdb_selected.fna -dbsize 100000000 -evalue 1 -out test_out -pam_search_all
Use case 2
An user generated a CRISPRDetect-formatted GFF of CRISPR arrays, and the user also have the BLASTDB and the index file computed previous call, as well as the shuffled version of these. The user can search for the CRISPR-spacer target and compute P-value directly with this command. In this case, the user will need BLASTDBs and the index files from the both the genomes and the shuffled sequences.
perl CRISPRTarget.pl -gff sample_crispr_gff/PSA.crispr.gff -db USER_DB/vhdb_selected.fna -ctrl_db USER_SHUFFLED_DB/vhdb_selected.fna -dbsize 100000000 -evalue 1 -out test_out_2 -pam_search_all
The shuffled BLASTB and index file need not to be derived from the same BLASTDB to be queried. Because the shuffled BLASTB and index file serve as the background for p-value calculation, it has to be big and should cover a diverse range of organism and nucleotide composition.
The application will report targets found in the BLASTDB in the following ways:
- A tabular output that is equivalent to the one generated by the web-version, but with additional columns including column for the p-value of the score.
- A HTML output that is equivalent to the one generated by the web-version, but with p-value displayed. For clarity and transparency, the score is not decremented or incremented by repeat-region match and PAM-match. These information will still be displayed, and users can decide whether to take these into account or not.
- A text-file that is equivalent to the HTML output.
For technical reasons, the Javascript user-interface in the header as seen in the web version of the HTML output not provided.
Further down the track we will provide a mechanism to interface the command-line version of CRISPRTarget to CRISPRCasTyper, to type CRISPR arrays by repeat sequences using the utility "RepeatType".
This will allow a better way to search for PAM, as different types of CRISPR arrays have their own set of PAMs. By default, we search for PAM under the constraint that super-types of CRISPR arrays must match, and if type information is unavailable, PAM will not be searched.
In the current release, to enable PAM search without type information, the command-line argument "-pam_search_all" can be used to search for PAM independent to CRISPR arrays types.