Secure VCF Compressor

Installation

First of all, make sure to clone the repository with all the submodules

git clone git@github.com:tGautot/SVC.git --recursive

If you already cloned the repo and forgot the --recursive, then you can simply run

git submodule update --init

Then some libraries need to be installed. First you'll need vcflib-dev, for ubuntu see below, otherwise: https://github.com/vcflib/vcflib#install

sudo apt-get install libvcflib-dev

For encryption, this project uses libsodium, you can find insatallation details here https://doc.libsodium.org/installation: You then also need to install the -dev library

sudo apt-get install libsodium-dev

The project also uses openMP

sudo apt-get install libomp-dev

In the external folder, there should be a jbigkit library. A few changes need to be made to it to make it linkable with C++. First of all, in jbigkit/Makefile you need to change the compiler from gcc to g++ (in version 2.1 of the library, it is the first un-commented line of the Makefile) You then need to do the same thing jbigkit/libjbig/Makefile

Once this is done, head to the jbigkit/libjbig/jbig.h file. At the end of this file you should find all the function prototypes declared. Enclose those with the following:

#ifdef __cplusplus
extern "C"{
#endif 

// ... All function prototypes should be here

#ifdef __cplusplus
}
#endif

In this external folder there should also be the zlib library. Go into its folder and then simply

./configure
make test
make install

Now everything should be setup for you to compile and use the code. In the root folder simply do:

make svc
make keyhandler

Usage

There are 2 main courses of action: - Compression + Encryption: going from a .vcf file to a .svc file (the extension used for the output of this program) - Decompression + Decryption: going the other way around

Compressing and Encrypting VCFs

When doing this step, you can use the svc_exec binary created by make svc with the following (required) arguments

-f filename # Path to the VCF file to compress
-o filename # Result file, containing the compressed and encrypted vcf file (give it the .svc extension :) )
-j n_thread # Number of process threads, there is, additionally, always 1 input thread and 1 output thread
            # Note that anytime, there can be more than just n_thread+2 threads alive, since the different process/input threads
            # might spawn sub-threads (with openMP) to further parallelize some tasks
-i filename # Path to the index file, more on this later
-k filename # Key file, were all keys used for encryption will be written

Index file

To give the user more control over how the VCF file is handled, encrpted and sectioned, SVC uses index files. These files are simple csv that define blocks of contiguous genotypes (e.g. genes). These csv can contain any number of columns but require at least 3 of them: chrom,start_pos,end_pos

For example, a simple index file could look like this, make sure that the entries are sorted, otherwise some might be skipped:

ID,chrom,start_pos,end_pos,Metadata
0,20,1000,5000,Chr20 Gene1 special
1,22,111222,333444,Chr22 Bad Gene
2,22,555666,777888,Chr22 Good Gene

This index file defines 3 sections (one for each line), each section will be encrypted independently and use only one encryption key. This makes later retrieval of this section easier and more secure since only a single key need to be handed off to someone for them to be able to decrypt this section. As we'll see in the next chapter, every section will be assigned an id (which is simply equal to what you find in the ID column here, although it is independant from it), wich can be used alongside the keyhandler binary to retrieve keys from the key file.

Decompressing and Decrypting SVCs

To do this, the same executable is used (svc_exec) but the parameters are different and change meaning.

[REQ] -r                   # Tells the executable to work in decompression/decryption mode 
[REQ] -f filename          # Path to the SVC file to decompress
[REQ] -o filename          # Result VCF file
[OPT] -c chromstt:chromend # Interval of chromosomes to query
[OPT] -p posstt:posend     # Interval of positions to query 
[REQ] -s filename          # File containing the comma-separated names of the samples you want kept in the result vcf file
[REQ] -k filename          # Key file, should contain at least the keys needed to decrypted the section queried

The -r is a required parameter and tells the executable to work in decompression/decrytpion mode instead of the default one.

The -c and -p together define the queried region. -c gives the start and end chromosomes (separated by a colon) and -p gives the start position (on the start chromosome) and the end position (on the end crhomosome) also separated by a color. This means, if for example you give -c 1:3 -p 100:100 then the executable will attempt to decompress all the genotypes beyond position 100 of chromosome 1, the whole of chromosome 2, and then all genotypes before position 100 on chromosome 3. It most often makes sense to only decrypt part of a single chromosome at once, which would for example give -c 22:22 -p 10000:20000

The file given to -s should contain a single line (with a newline at the end) with all the sample names you want in the output vcf file (for example if you only want three samples it could be: NA18956,NA18957,NA18959). !!!NOTE!!! that this is a quality of life feature, NOT A SECURITY ONE. the samples are filtered in the frontend, but all the genotypes of all the samples get decrypted and decompress beforehand.

The keyfile you give determines what you'll be able to decrypt. By giving the file as was created when encrypting, then you will be able to decrypt every single section, and thus the whole file

Keyhandler

With the keyhandler executable you can filter keys from a keyfile and thus limit waht section can be decoded by a person obtaining this new limited keyfile. The usage is simple: ./svc_keys -f inkeyfile -o outkeyfile It then waits for the sections' ids for which you want the keys in stdin. You can either write them out in your teminal (end with an empty line) or write them to a file and then pipe it to the utility.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
external		external
include		include
src		src
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SerialTester.sh		SerialTester.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Secure VCF Compressor

Installation

Usage

Compressing and Encrypting VCFs

Index file

Decompressing and Decrypting SVCs

Keyhandler

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

tGautot/SVC

Folders and files

Latest commit

History

Repository files navigation

Secure VCF Compressor

Installation

Usage

Compressing and Encrypting VCFs

Index file

Decompressing and Decrypting SVCs

Keyhandler

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages