A small set of utilities to work with SNP data and rsIDs. At the moment this includes 2 utilities:
rsid2pos: extract genomic coordinate given an rsID (useful for rsID based liftover)annotate: annotate variants with the corresponding rsIDmake_bin: convert dbSNP tables to binary representation for faster processing
Using binary encoding of dbSNP files and hashing the processing is super-fast and can perform these operations on a set of 30M variants in 15 minutes or 2-3 minutes if parallelized by chromosome.
You can just use the binary file of the release, it should work out of the box in most UNIX systems. If you want to compile it yourself, you need nim and nimble installed and you can use the following commands:
nimble build -d:release rsid_tools.nimbleVariants are represented internally by an hash made from {POS}{REF}_{ALT}. At the moment this string cannot be longer than 255 characters. Thus, there is a small fraction of dbSNP variants (about 100) represented by very long indels that can not be stored and thus will be missed by the utilities.
The utilities uses binary representation of variants to perform conversions. These binary files are generated using the make_bin command (see Make the binary section).
In general, it is better to run conversion by chromosomes to avoid large memory requirements to read data from all chromosomes, please see Performance considerations for more details.
Output is interpreted as a folder/prefix. When out is a directory, the resulting output file names will be {out}/{tool_prefix}-{fileprefix}.tsv, otherwise {out}-{tool_prefix}-{fileprefix}.tsv. Here tool_prefix contains name of the tool and the target genome build, and fileprefix is the prefix of the input file (removing extension and .gz if present). NB. Any folder present in the output path must exist.
For example, for the input file input.tsv.gz
- with
--out outputpointing to an existing folder, the output file will beoutput/rsid2pos_GRCh37-input.tsv - with
--out output/test, the output file will beoutput/test-rsid2pos_GRCh37-input.tsv
When --out is omitted the output is printed to stdout
This command will extract the genomic coordinates of a set of rsIDs. The usage is as follows:
Usage:
liftover [options] [intables ...]
Arguments:
[intables ...] input table/list containing rsID to liftover
Options:
-h, --help
-o, --out=OUT Output folder or prefix. Output structure is: rsid2pos_{target_build}-{infileprefix}.tsv. If not provided output to stdout
-e, --header Input file has header
-s, --sep=SEP Delimiter for input file (default: tab)
-i, --rsid_column=RSID_COLUMN
0-based index of column containing rsIDs in the input (default: 1)
-x, --chrom_column=CHROM_COLUMN
0-based index of column containing chromosome in the input. Use -1 if not present (default: 0)
-c, --chrom=CHROM Only process data for specific chromosomes. Comma-sep list accepted or -1 for all chromosomes (default: -1)
-t, --target=TARGET Target genome build for liftover Possible values: [GRCh37, GRCh38]
-n, --no_missing Do not include in output rsIDs that can't be liftovered to the target build
-d, --map_dir=MAP_DIR Directory containing binaries files generated by make_binThis command will annotate a set of variants defined by genomic coordinates, ref and alt alleles with the corresponding rsID.
The input variants can be provided in a tab-separated file or as a file with a list of variant IDs. Variant IDs can be any string that can be parsed by regular expression. By default, this expects: chrom:pos:ref:alt but you can change this using the parse_exp argument. Keep in mind that the order of matched elements must be: chromosome, position, reference allele, alternative allele.
The usage is as follows:
Usage:
annotate [options] [intables ...]
Arguments:
[intables ...] Tables containing variants to annotate. Can be a list of variant IDs or a table with columns containing variant ID, chromosome, position, ref and alt alleles
Options:
-h, --help
-s, --sep=SEP Column separator (default: tab)
-e, --header Input file has header
-b, --build=BUILD Genome build Possible values: [GRCh37, GRCh38]
-o, --out=OUT Output folder or prefix. Output structure is: rsid2pos_{target_build}-{infileprefix}.tsv. If not provided output to stdout
-i, --varid_column=VARID_COLUMN
0-based index of column containing variantID. If set all others columns will be ignored and information are parsed from id values (default: -1)
-e, --parse_exp=PARSE_EXP Pattern for parsing chrom, pos, ref, alt from variant ID (default: ([0-9XYM]+):([0-9]+):([ACTG]+):([ACTG]+))
-c, --chrom=CHROM Only process data for specific chromosomes. Comma-sep list accepted or -1 for all chromosomes (default: -1)
-x, --chrom_column=CHROM_COLUMN
0-based index of column containing chromosome in the input (default: -1)
-p, --pos_column=POS_COLUMN
0-based index of column containing position in the input (default: -1)
-r, --ref_column=REF_COLUMN
0-based index of column containing REF allele in the input (default: -1)
-a, --alt_column=ALT_COLUMN
0-based index of column containing ALT allele in the input (default: -1)
-d, --map_dir=MAP_DIR Directory containing binaries files generated by make_bin
-n, --no_missing Do not include in output rsIDs that can't be liftovered to the target buildThen chromosome information is present in the input, the tools only load information for the needed chromosome(s). This makes the process faster and more efficient. Chromosome is mandatory for annotate, but it is recommended to provide chromosome information also when running rsid2pos. Othwerwise, the tool will load all the binary files, this takes time and memory (up to 64GB for the GRCh38 dbSNP151).
The --chrom argument makes easy to parallelize the process by chromosome. You can use the following command to parallelize the process by chromosome. This accepts a comma-separated list and only variants located on the chromosome(s) provided will be processed. This is convenient to automatically split execution across chromosomes and then merge the results.
For example:
CHROM="1"
BUILD="GRCh38"
rsid_tools rsid2pos -t ${BUILD} -o /results/${CHROM} -c $CHROM -d /path/to/binaries/ /path/to/input.tsvFor rsid2pos this approach makes sense only when the input file also contains chromosome information. Otherwise, the tool always need to load binary data for all chromosome to be sure your variant can be found so the speed up is minimal.
To be able to perform conversions, you first need to prepare binary representation of you dnbSNP tables. This can be done using the make_bin command. This will convert any input table containing SNP information into 2 binary files, used for genomic coordinate annotation or rsID annotation. These files are named {BUILD}_{CHROM}.rsid2pos.bin and {BUILD}_{CHROM}.hash2rsid.bin and must be all stored in the same folder. Ideally, you want to have one such folder for each dbSNP release.
These dbSNP tables must be tab-separated, eventually gzipped, and containing the following information: chromosome, rsID, position, reference allele, alternative allele. You can set the order of columns using the corresponding argument. The header, if present, must have # as the first character.
Note that each input file must contain SNPs from a single chromosome and only one file per chromosome is allowed at the moment.
The resulting
You can generate suitable tables from the dbSNP VCF files using the following command:
BUILD="GRCh37"
CHROM="1"
DBSNP_VERSION="151"
echo -e "#CHROM\tRSID\tPOS\tREF\tALT" > ${BUILD}_dbSNP${DBSNP_VERSION}.chr${CHROM}.tsv
bcftools query -r $CHROM -f "%CHROM\t%ID\t%POS\t%REF\t%ALT\n" ${BUILD}-All.vcf.gz | sed 's/rs//g' | sort -k1,1V -k2,2n >> ${BUILD}_dbSNP${DBSNP_VERSION}.chr${CHROM}.tsv
bgzip ${BUILD}_dbSNP${DBSNP_VERSION}.chr${CHROM}.tsvThe usage for the make_bin command is as follows:
Usage:
make_bin [options] [intables ...]
Arguments:
[intables ...] dbSNP tables to convert to binary
Options:
-h, --help
-b, --build=BUILD Genome build Possible values: [GRCh37, GRCh38]
-v, --dbsnp_version=DBSNP_VERSION
dbSNP version
-o, --out=OUT Output folder (default: ./)
-i, --rsid_column=RSID_COLUMN
0-based index of column containing rsIDs in the input (default: 1)
-x, --chrom_column=CHROM_COLUMN
0-based index of column containing chromosome in the input (default: 0)
-p, --pos_column=POS_COLUMN
0-based index of column containing position in the input (default: 2)
-r, --ref_column=REF_COLUMN
0-based index of column containing REF allele in the input (default: 3)
-a, --alt_column=ALT_COLUMN
0-based index of column containing ALT allele(s) in the input. If multiple ALTs they must be comma-separated (default: 4)