aim/Datasets at main · safaad/aim

Name	Name	Last commit message	Last commit date
parent directory ..
ERR240727-l100-e1-30000Pairs	ERR240727-l100-e1-30000Pairs
README.md	README.md
sample-l100-e1-40K	sample-l100-e1-40K

Name

Last commit message

Last commit date

The datasets can be downloaded from https://zenodo.org/record/6938771#.YuZef3ZByUk !

We evaluate AIM using two types of datasets: (1) real datasets for short sequence pairs, (2) synthetic datasets for long sequence pairs

Real datasets for short sequence pairs (100bp, 150bp, and 250bp long)

The short read-reference pair sets are generated using minimap2 by mapping the following datasets to the human reference genome GRCh37:

The human reference genome can be downloaded from:

ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz

After extracting the read-reference pairs we format the datasets as shown in the samples above.

Synthetic datasets for long sequence pairs (500bp, 1000bp, 5Kbp, and 10Kbp long)

We use the data generator found in WFA's repository (https://github.com/smarco/WFA) to simulate long sequence pairs:

git clone https://github.com/smarco/WFA
cd WFA
make
# Example of generating synthetic dataset with sequence length 100, % edit distance 1% and 5M pairs
./bin/generate_dataset --n 5000000 --l 100 --e 0.01 --o ./synthetic-l100-e1-5MPairs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

The datasets can be downloaded from https://zenodo.org/record/6938771#.YuZef3ZByUk !

Real datasets for short sequence pairs (100bp, 150bp, and 250bp long)

Synthetic datasets for long sequence pairs (500bp, 1000bp, 5Kbp, and 10Kbp long)

FilesExpand file tree

Datasets

Directory actions

More options

Directory actions

More options

Latest commit

History

Datasets

Folders and files

parent directory

README.md

The datasets can be downloaded from https://zenodo.org/record/6938771#.YuZef3ZByUk !

Real datasets for short sequence pairs (100bp, 150bp, and 250bp long)

Synthetic datasets for long sequence pairs (500bp, 1000bp, 5Kbp, and 10Kbp long)