- CRISPR-Cas is a bacterial immune system also famous for its use in genome editing. The diversity of known systems could be significantly increased by metagenomic data.
- Here we present the Metagenomic CRISPR Array Analysis Tool MCAAT, a highly sensitive algorithm for finding CRISPR Arrays in un-assembled metagenomic data.
- It takes advantage of the properties of CRISPR arrays that form multicycles in de Bruijn graphs.
- MCAAT's assembly-free graph-based strategy outperforms assembly-based workflows and other assembly-free methods on synthetic and real metagenomes.
- Docker container available under: https://hub.docker.com/r/feeka94/mcaat
- Version 0.3 makes use of following optimization techniques:
- Better data structures for preprocessing,
phmap::flat_hash_set - Added compiler intrinsics to guide the hardware in the right direction
- Reserving the capacity to prevent rehashing In depth technical details: educational resource and optimization developer notes. As a result of the above optimizations we achieved 17-25 times speedup in 1billion node graph(from 3 days to 3 hours). Considering the complexity of the graphs, this is a huge improvement.
- Better data structures for preprocessing,
docker build -t mcaat .Mount your working directory to access input/output files:
docker run --rm -v $(pwd):/data mcaat \
--input_files /data/reads_R1.fastq /data/reads_R2.fastq \
--output-folder /data/resultsThe final image is based on debian:bookworm-slim and includes only:
- The
mcaatbinary - Runtime libraries:
libomp5,zlib1g
This keeps the image small and portable.
To remove the image:
docker rmi mcaatTo allow ./install.sh make changes, we execute following command:
chmod +x ./install.shYou can build the project and the working version will be saved in the build folder.
./install.shIt is also possible to install the library by simply putting the --install flag.
./install.sh --installTo clean up you can use --clean flag.
./mcaat --input-files <file1> [file2] [--ram <amount>] [--threads <num>] [--output-folder <path>] [--help]| Argument | Description |
|---|---|
--input_files <file1> [file2] |
One or two input FASTA/FASTQ files. If one file is provided, it is treated as single-end data. If two files are provided, they are treated as paired-end reads. |
| Argument | Description |
|---|---|
--ram <amount> |
Maximum RAM to use. Units: B, K, M, G. Default: 95% of system RAM Example: --ram 4G |
--threads <num> |
Number of threads to use. Default: total CPU cores minus 2 |
--output-folder <path> |
Output directory for results. If not provided, a timestamped folder will be created automatically. If provided, the folder is used exactly as given. |
--help, -h |
Show usage information and exit |
The tool creates the following directory structure inside the specified output folder:
<output-folder>/
├── CRISPR_Arrays.txt # Raw CRISPR array output
| Scenario | Command |
|---|---|
| Paired-end input with custom output | ./mcaat --input_files reads_R1.fastq reads_R2.fastq --ram 8G --threads 12 --output-folder results/my_run |
| Single-end input with default output | ./mcaat --input_files reads.fastq Creates a folder like mcaat_run_2025-07-07_15-30-00/ |
- Input files must exist and be accessible.
- If RAM is set below 1 GB or above system capacity, the program will exit with an error.
- If only one input file is provided, the tool assumes single-end data.
Create a simple key=value text file (one setting per line) and pass it with --settings /path/to/file.
The program reads values from this file unless you override them with CLI flags. If you change the file, run the program again — new values will be used.
Example of settings.txt (must include input_files):
# MUST INCLUDE
input_files=/data/sample_folder/1.fastq /data/sample_folder/2.fastq.fastq
ram=128G
threads=26
output_folder=results/run_2025-11-19
# OPTIONAL
cycle_max_length=77
cycle_min_length=27
threshold_multiplicity=20
low_abundance=true
Notes:
input_filesaccepts one or two paths; entries may be separated by spaces, commas, or semicolons.- Terminal values will override the
settings.txt. For example for simplicity you can use thesettings.txtfile and change only-iparameter.
- C++17 compiler
- RapidFuzz (for fuzzy string matching)
- Filesystem support (
<filesystem>)
If you encounter issues or have questions, feel free to open an issue or write us an email: fikrat.talibli@ibmg.uni-stuttgart.de. If you are using this software please cite this paper: https://academic.oup.com/microlife/article/doi/10.1093/femsml/uqaf016/8205558.