Skip to content

Commit e607396

Browse files
committed
compileup bgt-server in makefile optionally
1 parent 65cac43 commit e607396

File tree

5 files changed

+61
-46
lines changed

5 files changed

+61
-46
lines changed

Makefile

+3
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,9 @@ libbgt.a:$(OBJS)
2222
bgt:libbgt.a main.o import.o view.o
2323
$(CC) main.o import.o view.o -o $@ $(LIBS)
2424

25+
bgt-server:bgt-server.go libbgt.a
26+
go build bgt-server.go
27+
2528
pbfview:pbfview.o pbwt.o
2629
$(CC) $^ -o $@
2730

tex/Makefile

+1-1
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@ bgt.pdf:bgt.tex bgt.bib
1515
pdflatex bgt; bibtex bgt; pdflatex bgt; pdflatex bgt;
1616

1717
clean:
18-
rm -fr *.toc *.aux *.bbl *.blg *.idx *.log *.out *~
18+
rm -fr *.toc *.aux *.bbl *.blg *.idx *.log *.out *~ bgt.pdf

tex/bgt.bib

+8
Original file line numberDiff line numberDiff line change
@@ -29,3 +29,11 @@ @article{Stade:2014ty
2929
Title = {{GrabBlur}--a framework to facilitate the secure exchange of whole-exome and -genome {SNV} data using {VCF} files},
3030
Volume = {15 Suppl 4},
3131
Year = {2014}}
32+
33+
@article{Ruan:2015ab,
34+
author = {Layer, Ryan M and others},
35+
title = {Efficient compression and analysis of large genetic variation datasets},
36+
year = {2015},
37+
doi = {10.1101/018259},
38+
publisher = {Cold Spring Harbor Labs Journals},
39+
journal = {bioRxiv}}

tex/bgt.tex

+44-40
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,9 @@ \section{Summary:} BGT is a compact format, a fast command line tool and a
2727
simple web application for efficiently and conveniently querying whole-genome
2828
genotypes and frequencies across tens to hundreds of thousands of samples.
2929
On real data, it compresses the genotypes of 32,488 samples across 39.2
30-
million sites down to a 7.4GB database and processes a couple of hundred
31-
million genotypes per CPU second.
30+
million SNPs down to a 7.4GB database and processes a couple of hundred
31+
million genotypes per CPU second. The high performance enables real-time
32+
responses to complex queries.
3233

3334
\section{Availability and implementation:} https://github.com/lh3/bgt
3435

@@ -40,7 +41,7 @@ \section{Introduction}
4041
VCF/BCF~\citep{Danecek:2011qy} is the primary format for storing and analyzing
4142
genotypes of multiple samples. It however has a few issues. Firstly, as a
4243
streaming format, VCF compresses all types of information together. Retrieving
43-
site annotations or the genotypes of a few samples usually requires to decoding
44+
site annotations or the genotypes of a few samples usually requires to decode
4445
the genotypes of all samples, which is unnecessarily expensive. Secondly, VCF
4546
does not take advantage of linkage disequillibrium (LD), while using this
4647
information can dramatically improve compression ratio~\citep{Durbin:2014yq}.
@@ -49,8 +50,11 @@ \section{Introduction}
4950
ambiguity complicates annotations, query of alleles and integration of multiple
5051
data sets. At last, most existing VCF-based tool chains do not support
5152
expressive data query. We frequently need to write scripts for advanced
52-
queries, which costs both development and processing time. BGT is designed to
53-
overcome or alleviate these issues.
53+
queries, which costs both development and processing time.
54+
GQT~\citep{Ruan:2015ab} attempts to solve some of these issues. While it is
55+
very fast for selecting a subset of samples and for traversing all sites, it
56+
discards phasing, is inefficient for region query and is not compressed well.
57+
These observations motivated us to develop BGT.
5458

5559
\begin{methods}
5660
\section{Methods}
@@ -110,27 +114,28 @@ \subsubsection{PBWT overview}
110114
$S_{k-1}$, computing $\phi_k\to S_k\to A_k$ derives $A_k$ from $B_k$.
111115

112116
When there are strong correlations between adjacent rows, which is true for
113-
haplotype data due to linkage disequillibrium, $0$s and $1$s tend to form long
117+
haplotype data due to LD, $0$s and $1$s tend to form long
114118
runs in $B_k$. This usually makes $B_k$ much more compressible than $A_k$ under
115-
run-length encoding. For our test data set, 32 thousand samples can be
116-
compressed to less than 200 bytes in average.
119+
run-length encoding. For our test data set, 32 thousand genotypes in a row can
120+
be compressed to less than 200 bytes in average.
117121

118122
\subsection{Query with BGT}
119123

120124
\subsubsection{Flat Metadata Format (FMF)}
121125

122-
BGT introduces a new but simple text format, FMF, to manage meta data. FMF is
123-
TAB-delimited with the first column showing the row name and following columns
124-
giving typed key-value pairs. An example looks like:
126+
BGT introduces a new but simple text format, FMF, to manage meta data. FMF has
127+
a similar data model to wide-column stores. It is TAB-delimited with the first
128+
column showing the row name and following columns giving typed key-value pairs.
129+
An example looks like:
125130
\begin{center}
126131
\begin{verbatim}
127132
sample1 gender:Z:M height:f:1.73 foo:i:10
128133
sample2 gender:Z:F height:f:1.64 bar:i:20
129134
\end{verbatim}
130135
\end{center}
131136
where rows can be retrieved by an arbitrary expression such as
132-
``height$>$1.65''. It has a similar data model to wide-column databases. BGT
133-
uses FMF to keep and query sample phenotypes and variant annotations.
137+
``height$>$1.65''. BGT uses FMF to keep and query sample phenotypes and variant
138+
annotations.
134139

135140
\subsubsection{Query genotypes}
136141

@@ -155,22 +160,23 @@ \subsection{BGT server}
155160
BGT comes with a standalone web server frontend implemented in the Go
156161
programming language. The server has a similar interface to the command line
157162
tool, but with additional consideration of sample anonymity. With BGT,
158-
each sample has an attribute minimal group size or MGS. On query, the server
163+
each sample has an attribute `minimal group size' or MGS. On query, the server
159164
refuses to create a sample group if the size of this group is smaller than the
160165
MGS of one sample in this group. In particular, if a sample has MGS larger than
161166
one, users cannot access its sample name and individual genotypes, but can
162-
retrieve allele counts computed together with other samples.
167+
retrieve allele counts computed together with other samples. This prevents
168+
users to access data at the level of a single sample.
163169

164170
\end{methods}
165171

166172
\section{Results}
167173

168174
We generated the BGT database for the first release of Haplotype Reference
169-
Consortium (HRC; http://bit.ly/HRC-org). The input is a BCF containing 32,488 samples across 39.2
170-
million sites on autosomes. The BGT file size is 7.4GB, 11\% of the
171-
genotype-only BCF. Decoding the genotypes of all
172-
samples across 142k sites in a 10Mbp region takes 11 CPU seconds, which amounts
173-
to decoding 420 million genotypes per second. This speed is even faster than
175+
Consortium (HRC; http://bit.ly/HRC-org). The input is a BCF containing 32,488
176+
samples across 39.2 million SNPs on autosomes. The BGT file size is 7.4GB, 11\%
177+
of the genotype-only BCF, or 8\% of GQT. Decoding the genotypes of all samples
178+
across 142k sites in a 10Mbp region takes 11 CPU seconds, which amounts to
179+
decoding 420 million genotypes per second. This speed is even faster than
174180
computing allele counts and outputting VCF.
175181

176182
We use the following command line to demonstrate the query syntax of BGT:
@@ -182,59 +188,57 @@ \section{Results}
182188
HRC-r1.bgt
183189
\end{verbatim}
184190
\end{center}
185-
It finds chr11 coding variants annotated in `var.fmf.gz' that have $\ge$0.1\%
186-
frequency in the IBD data set (http://www.ibdresearch.co.uk) but absent from
191+
It finds BRCA1 variants annotated in `var.fmf.gz' that have $\ge$0.1\%
192+
frequency in both the IBD data set (http://www.ibdresearch.co.uk) and
187193
1000 Genomes~\citep{1000-Genomes-Project-Consortium:2012aa}. In this command line, {\tt -G} disables the output of genotypes.
188194
Option {\tt -a} selects variants with the `gene' attribute equal to `BRCA1'
189195
according to the variant database specified with {\tt -d}. This condition
190196
is indepenent of sample genotypes. Each option {\tt -s} selects a group of
191-
samples, again independent of sample genotypes. For the \char35-th sample
197+
samples based on phenotypes. For the \char35-th sample
192198
group/{\tt -s}, BGT counts the total number of called alleles and the number of
193199
non-reference alleles and writes them to the {\tt AN\char35} and {\tt
194200
AC\char35} aggregate variables, respectively. Option {\tt -f} then use these
195201
aggregate variables to filter output.
196202

197-
This command line takes 12 CPU seconds with most of time spent on reading
203+
The command line above takes 12 CPU seconds with most of time spent on reading
198204
through the variant annotation file to find matching alleles. The BGT server
199205
reads the entire file into memory to alleviate the overhead, but a better
200206
solution would be to set up a proper database for variant annotations.
201207

202208
\section{Discussions}
203209

204210
Given a multi-sample VCF, most BGT functionalities can be achieved with small
205-
scripts. However, as a command line tool, BGT is still an advance in several
206-
ways. Firstly, it saves development time. Extracting information from multiple
207-
files can be done with a command line instead of a script. Secondly,
211+
scripts, but as a command line tool, BGT still has a few advantages. Firstly, it
212+
saves development time. Extracting information from multiple files can be done
213+
with a command line instead of a script. Secondly,
208214
BGT saves processing time. With high-performance C code at the core, BGT is
209215
much faster than processing VCF in a scripting language such as Perl or Python.
210216
For example, deriving allele counts in a 10Mbp region for the HRC data takes 30
211217
seconds with BGT, but doing the same with a Perl script takes 40 minutes, a
212218
80-fold difference. Thirdly, the design of one non-reference allele per record
213-
and the separation of variant annotation from genotype data helps more scalable
214-
data processing. These design choices make BGT merge is much simpler and faster
215-
than generic VCF merge. This enables efficient query across multiple BGT
216-
databases which is not practical with VCF.
217-
218-
The BGT server tries to solve a bigger problem: data sharing. A key feature of
219-
the server is to make the user-requested summary of genotypes available while
220-
keeping samples unidentifiable~\citep{Stade:2014ty}. Instead of always
219+
makes BGT merge much simpler and faster than generic VCF merge. This enables
220+
efficient query across multiple BGT databases which is not practical with VCF.
221+
222+
The BGT server tries to solve a bigger problem: data sharing. Instead of always
221223
delivering full data in VCF, projects could have a new option to serve data
222-
publicly with BGT, letting users select the summary statistics of interest
223-
without breaching privacy.
224+
publicly with the BGT server, letting users select the summary statistics of interest
225+
on the fly while keeping samples unidentifiable. This is an improvement to
226+
\citet{Stade:2014ty} which only provide precomputed summary.
224227

225228
% summary data is smaller, but genotypes in PBWT is not large, either.
226229

227230
We acknowledge that our MGS-based data sharing policy might have oversimplified
228231
real scenarios, but we believe this direction, with proper improvements and
229232
more importantly the approval of ethical review boards, will be more open,
230233
convenient, efficient and secure than our current
231-
share-everything-with-permission model.
234+
share-everything-under-permission model.
232235

233236
\section*{Acknowledgement}
234237
I am grateful to HRC for granting the permission to use the data for evaluating
235-
the performance of BGT and thank the Global Alliance for the helpful
236-
suggestions.
238+
the performance of BGT and thank the Global Alliance Data Working Group for the
239+
helpful suggestions.
237240
\paragraph{Funding\textcolon} NHGRI U54HG003037; NIH GM100233
238241

239242
\bibliography{bgt}
243+
240244
\end{document}

tex/bioinfo.cls

+5-5
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
\newcommand\classname{bioinfo2}
1+
\newcommand\classname{bioinfo}
22
\newcommand\lastmodifieddate{2003/02/08}
33
\newcommand\versionnumber{0.1}
44

@@ -454,7 +454,7 @@
454454
\vspace*{\aboveskipchk}%
455455
\vspace{\dropfromtop}%
456456
\hbox to \textwidth{%
457-
{\helvetica\itshape\bfseries\fontsize{19}{12}\selectfont {\color{gray}PREPRINT}
457+
{\helvetica\itshape\bfseries\fontsize{19}{12}\selectfont {\color{gray}BIOINFORMATICS}
458458
\hfil
459459
\if@appnotes APPLICATIONS NOTE\hfil\fi
460460
}%
@@ -475,9 +475,9 @@
475475
\vspace{4\p@}
476476
{\helvetica\fontsize{10}{12}\selectfont\raggedright \@address \par}%
477477
\vspace{4\p@}
478-
%{\helvetica\fontsize{8}{10}\selectfont\raggedright \@history \par}
479-
%\vspace{24\p@}
480-
%{\helvetica\fontsize{10}{12}\selectfont\raggedright \@editor \par}
478+
{\helvetica\fontsize{8}{10}\selectfont\raggedright \@history \par}
479+
\vspace{24\p@}
480+
{\helvetica\fontsize{10}{12}\selectfont\raggedright \@editor \par}
481481
%\vspace{20\p@}
482482
}%
483483
}

0 commit comments

Comments
 (0)