The datasets can be downloaded from https://zenodo.org/record/6938771#.YuZef3ZByUk !
We evaluate AIM using two types of datasets: (1) real datasets for short sequence pairs, (2) synthetic datasets for long sequence pairs
The short read-reference pair sets are generated using minimap2 by mapping the following datasets to the human reference genome GRCh37:
- https://www.ebi.ac.uk/ena/browser/view/ERR240727
- https://www.ebi.ac.uk/ena/browser/view/SRR826460
- https://www.ebi.ac.uk/ena/browser/view/SRR826471
The human reference genome can be downloaded from:
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gzAfter extracting the read-reference pairs we format the datasets as shown in the samples above.
We use the data generator found in WFA's repository (https://github.com/smarco/WFA) to simulate long sequence pairs:
git clone https://github.com/smarco/WFA
cd WFA
make
# Example of generating synthetic dataset with sequence length 100, % edit distance 1% and 5M pairs
./bin/generate_dataset --n 5000000 --l 100 --e 0.01 --o ./synthetic-l100-e1-5MPairs