First of all, make sure to clone the repository with all the submodules
git clone git@github.com:tGautot/SVC.git --recursive
If you already cloned the repo and forgot the --recursive
, then you can simply run
git submodule update --init
Then some libraries need to be installed. First you'll need vcflib-dev, for ubuntu see below, otherwise: https://github.com/vcflib/vcflib#install
sudo apt-get install libvcflib-dev
For encryption, this project uses libsodium, you can find insatallation details here https://doc.libsodium.org/installation:
You then also need to install the -dev
library
sudo apt-get install libsodium-dev
The project also uses openMP
sudo apt-get install libomp-dev
In the external
folder, there should be a jbigkit
library. A few changes need to be made to it to make it linkable with C++.
First of all, in jbigkit/Makefile
you need to change the compiler from gcc to g++ (in version 2.1 of the library, it is the first un-commented line of the Makefile)
You then need to do the same thing jbigkit/libjbig/Makefile
Once this is done, head to the jbigkit/libjbig/jbig.h
file. At the end of this file you should find all the function prototypes declared. Enclose those with the following:
#ifdef __cplusplus
extern "C"{
#endif
// ... All function prototypes should be here
#ifdef __cplusplus
}
#endif
In this external
folder there should also be the zlib
library. Go into its folder and then simply
./configure
make test
make install
Now everything should be setup for you to compile and use the code. In the root folder simply do:
make svc
make keyhandler
There are 2 main courses of action:
- Compression + Encryption: going from a .vcf
file to a .svc
file (the extension used for the output of this program)
- Decompression + Decryption: going the other way around
When doing this step, you can use the svc_exec
binary created by make svc
with the following (required) arguments
-f filename # Path to the VCF file to compress
-o filename # Result file, containing the compressed and encrypted vcf file (give it the .svc extension :) )
-j n_thread # Number of process threads, there is, additionally, always 1 input thread and 1 output thread
# Note that anytime, there can be more than just n_thread+2 threads alive, since the different process/input threads
# might spawn sub-threads (with openMP) to further parallelize some tasks
-i filename # Path to the index file, more on this later
-k filename # Key file, were all keys used for encryption will be written
To give the user more control over how the VCF file is handled, encrpted and sectioned, SVC uses index files. These files are simple csv that define blocks of contiguous genotypes (e.g. genes). These csv can contain any number of columns but require at least 3 of them: chrom,start_pos,end_pos
For example, a simple index file could look like this, make sure that the entries are sorted, otherwise some might be skipped:
ID,chrom,start_pos,end_pos,Metadata
0,20,1000,5000,Chr20 Gene1 special
1,22,111222,333444,Chr22 Bad Gene
2,22,555666,777888,Chr22 Good Gene
This index file defines 3 sections (one for each line), each section will be encrypted independently and use only one encryption key. This makes later retrieval of this section easier and more secure since only a single key need to be handed off to someone for them to be able to decrypt this section.
As we'll see in the next chapter, every section will be assigned an id (which is simply equal to what you find in the ID column here, although it is independant from it), wich can be used alongside the keyhandler
binary to retrieve keys from the key file.
To do this, the same executable is used (svc_exec
) but the parameters are different and change meaning.
[REQ] -r # Tells the executable to work in decompression/decryption mode
[REQ] -f filename # Path to the SVC file to decompress
[REQ] -o filename # Result VCF file
[OPT] -c chromstt:chromend # Interval of chromosomes to query
[OPT] -p posstt:posend # Interval of positions to query
[REQ] -s filename # File containing the comma-separated names of the samples you want kept in the result vcf file
[REQ] -k filename # Key file, should contain at least the keys needed to decrypted the section queried
The -r
is a required parameter and tells the executable to work in decompression/decrytpion mode instead of the default one.
The -c
and -p
together define the queried region. -c
gives the start and end chromosomes (separated by a colon) and -p
gives the start position (on the start chromosome) and the end position (on the end crhomosome) also separated by a color. This means, if for example you give -c 1:3 -p 100:100
then the executable will attempt to decompress all the genotypes beyond position 100 of chromosome 1, the whole of chromosome 2, and then all genotypes before position 100 on chromosome 3. It most often makes sense to only decrypt part of a single chromosome at once, which would for example give -c 22:22 -p 10000:20000
The file given to -s
should contain a single line (with a newline at the end) with all the sample names you want in the output vcf file (for example if you only want three samples it could be: NA18956,NA18957,NA18959
). !!!NOTE!!! that this is a quality of life feature, NOT A SECURITY ONE. the samples are filtered in the frontend, but all the genotypes of all the samples get decrypted and decompress beforehand.
The keyfile you give determines what you'll be able to decrypt. By giving the file as was created when encrypting, then you will be able to decrypt every single section, and thus the whole file
With the keyhandler
executable you can filter keys from a keyfile and thus limit waht section can be decoded by a person obtaining this new limited keyfile.
The usage is simple: ./svc_keys -f inkeyfile -o outkeyfile
It then waits for the sections' ids for which you want the keys in stdin. You can either write them out in your teminal (end with an empty line) or write them to a file and then pipe it to the utility.