Skip to content

HTGenomeAnalysisUnit/rsid_tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rsID tools

A small set of utilities to work with SNP data and rsIDs. At the moment this includes 2 utilities:

  • rsid2pos: extract genomic coordinate given an rsID (useful for rsID based liftover)
  • annotate: annotate variants with the corresponding rsID
  • make_bin: convert dbSNP tables to binary representation for faster processing

Using binary encoding of dbSNP files and hashing the processing is super-fast and can perform these operations on a set of 30M variants in 15 minutes or 2-3 minutes if parallelized by chromosome.

Installation

You can just use the binary file of the release, it should work out of the box in most UNIX systems. If you want to compile it yourself, you need nim and nimble installed and you can use the following commands:

nimble build -d:release rsid_tools.nimble

Known limitations

Variants are represented internally by an hash made from {POS}{REF}_{ALT}. At the moment this string cannot be longer than 255 characters. Thus, there is a small fraction of dbSNP variants (about 100) represented by very long indels that can not be stored and thus will be missed by the utilities.

Usage

The utilities uses binary representation of variants to perform conversions. These binary files are generated using the make_bin command (see Make the binary section).

In general, it is better to run conversion by chromosomes to avoid large memory requirements to read data from all chromosomes, please see Performance considerations for more details.

Output is interpreted as a folder/prefix. When out is a directory, the resulting output file names will be {out}/{tool_prefix}-{fileprefix}.tsv, otherwise {out}-{tool_prefix}-{fileprefix}.tsv. Here tool_prefix contains name of the tool and the target genome build, and fileprefix is the prefix of the input file (removing extension and .gz if present). NB. Any folder present in the output path must exist.

For example, for the input file input.tsv.gz

  • with --out output pointing to an existing folder, the output file will be output/rsid2pos_GRCh37-input.tsv
  • with --out output/test, the output file will be output/test-rsid2pos_GRCh37-input.tsv

When --out is omitted the output is printed to stdout

rsid2pos

This command will extract the genomic coordinates of a set of rsIDs. The usage is as follows:

Usage:
  liftover [options] [intables ...]

Arguments:
  [intables ...]   input table/list containing rsID to liftover

Options:
  -h, --help
  -o, --out=OUT              Output folder or prefix. Output structure is: rsid2pos_{target_build}-{infileprefix}.tsv. If not provided output to stdout
  -e, --header               Input file has header
  -s, --sep=SEP              Delimiter for input file (default: tab)
  -i, --rsid_column=RSID_COLUMN
                             0-based index of column containing rsIDs in the input (default: 1)
  -x, --chrom_column=CHROM_COLUMN
                             0-based index of column containing chromosome in the input. Use -1 if not present (default: 0)
  -c, --chrom=CHROM          Only process data for specific chromosomes. Comma-sep list accepted or -1 for all chromosomes (default: -1)
  -t, --target=TARGET        Target genome build for liftover Possible values: [GRCh37, GRCh38]
  -n, --no_missing           Do not include in output rsIDs that can't be liftovered to the target build
  -d, --map_dir=MAP_DIR      Directory containing binaries files generated by make_bin

annotate

This command will annotate a set of variants defined by genomic coordinates, ref and alt alleles with the corresponding rsID.

The input variants can be provided in a tab-separated file or as a file with a list of variant IDs. Variant IDs can be any string that can be parsed by regular expression. By default, this expects: chrom:pos:ref:alt but you can change this using the parse_exp argument. Keep in mind that the order of matched elements must be: chromosome, position, reference allele, alternative allele.

The usage is as follows:

Usage:
  annotate [options] [intables ...]

Arguments:
  [intables ...]   Tables containing variants to annotate. Can be a list of variant IDs or a table with columns containing variant ID, chromosome, position, ref and alt alleles

Options:
  -h, --help
  -s, --sep=SEP              Column separator (default: tab)
  -e, --header               Input file has header
  -b, --build=BUILD          Genome build Possible values: [GRCh37, GRCh38]
  -o, --out=OUT              Output folder or prefix. Output structure is: rsid2pos_{target_build}-{infileprefix}.tsv. If not provided output to stdout
  -i, --varid_column=VARID_COLUMN
                             0-based index of column containing variantID. If set all others columns will be ignored and information are parsed from id values (default: -1)
  -e, --parse_exp=PARSE_EXP  Pattern for parsing chrom, pos, ref, alt from variant ID (default: ([0-9XYM]+):([0-9]+):([ACTG]+):([ACTG]+))
  -c, --chrom=CHROM          Only process data for specific chromosomes. Comma-sep list accepted or -1 for all chromosomes (default: -1)
  -x, --chrom_column=CHROM_COLUMN
                             0-based index of column containing chromosome in the input (default: -1)
  -p, --pos_column=POS_COLUMN
                             0-based index of column containing position in the input (default: -1)
  -r, --ref_column=REF_COLUMN
                             0-based index of column containing REF allele in the input (default: -1)
  -a, --alt_column=ALT_COLUMN
                             0-based index of column containing ALT allele in the input (default: -1)
  -d, --map_dir=MAP_DIR      Directory containing binaries files generated by make_bin
  -n, --no_missing           Do not include in output rsIDs that can't be liftovered to the target build

Performance considerations

How chromosome information is loaded

Then chromosome information is present in the input, the tools only load information for the needed chromosome(s). This makes the process faster and more efficient. Chromosome is mandatory for annotate, but it is recommended to provide chromosome information also when running rsid2pos. Othwerwise, the tool will load all the binary files, this takes time and memory (up to 64GB for the GRCh38 dbSNP151).

Parallelize by chromosome

The --chrom argument makes easy to parallelize the process by chromosome. You can use the following command to parallelize the process by chromosome. This accepts a comma-separated list and only variants located on the chromosome(s) provided will be processed. This is convenient to automatically split execution across chromosomes and then merge the results.

For example:

CHROM="1"
BUILD="GRCh38"

rsid_tools rsid2pos -t ${BUILD} -o /results/${CHROM} -c $CHROM -d /path/to/binaries/ /path/to/input.tsv

For rsid2pos this approach makes sense only when the input file also contains chromosome information. Otherwise, the tool always need to load binary data for all chromosome to be sure your variant can be found so the speed up is minimal.

Make the binary files

To be able to perform conversions, you first need to prepare binary representation of you dnbSNP tables. This can be done using the make_bin command. This will convert any input table containing SNP information into 2 binary files, used for genomic coordinate annotation or rsID annotation. These files are named {BUILD}_{CHROM}.rsid2pos.bin and {BUILD}_{CHROM}.hash2rsid.bin and must be all stored in the same folder. Ideally, you want to have one such folder for each dbSNP release.

These dbSNP tables must be tab-separated, eventually gzipped, and containing the following information: chromosome, rsID, position, reference allele, alternative allele. You can set the order of columns using the corresponding argument. The header, if present, must have # as the first character.

Note that each input file must contain SNPs from a single chromosome and only one file per chromosome is allowed at the moment.

The resulting

You can generate suitable tables from the dbSNP VCF files using the following command:

BUILD="GRCh37"
CHROM="1"
DBSNP_VERSION="151"

echo -e "#CHROM\tRSID\tPOS\tREF\tALT" > ${BUILD}_dbSNP${DBSNP_VERSION}.chr${CHROM}.tsv
bcftools query -r $CHROM -f "%CHROM\t%ID\t%POS\t%REF\t%ALT\n" ${BUILD}-All.vcf.gz | sed 's/rs//g' | sort -k1,1V -k2,2n >> ${BUILD}_dbSNP${DBSNP_VERSION}.chr${CHROM}.tsv

bgzip ${BUILD}_dbSNP${DBSNP_VERSION}.chr${CHROM}.tsv

The usage for the make_bin command is as follows:

Usage:
  make_bin [options] [intables ...]

Arguments:
  [intables ...]   dbSNP tables to convert to binary

Options:
  -h, --help
  -b, --build=BUILD          Genome build Possible values: [GRCh37, GRCh38]
  -v, --dbsnp_version=DBSNP_VERSION
                             dbSNP version
  -o, --out=OUT              Output folder (default: ./)
  -i, --rsid_column=RSID_COLUMN
                             0-based index of column containing rsIDs in the input (default: 1)
  -x, --chrom_column=CHROM_COLUMN
                             0-based index of column containing chromosome in the input (default: 0)
  -p, --pos_column=POS_COLUMN
                             0-based index of column containing position in the input (default: 2)
  -r, --ref_column=REF_COLUMN
                             0-based index of column containing REF allele in the input (default: 3)
  -a, --alt_column=ALT_COLUMN
                             0-based index of column containing ALT allele(s) in the input. If multiple ALTs they must be comma-separated (default: 4)

About

Small utilities to play with dbSNP rsID: annotate rsID for a var or get position for an rsID

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published