@@ -27,8 +27,9 @@ \section{Summary:} BGT is a compact format, a fast command line tool and a
27
27
simple web application for efficiently and conveniently querying whole-genome
28
28
genotypes and frequencies across tens to hundreds of thousands of samples.
29
29
On real data, it compresses the genotypes of 32,488 samples across 39.2
30
- million sites down to a 7.4GB database and processes a couple of hundred
31
- million genotypes per CPU second.
30
+ million SNPs down to a 7.4GB database and processes a couple of hundred
31
+ million genotypes per CPU second. The high performance enables real-time
32
+ responses to complex queries.
32
33
33
34
\section {Availability and implementation: } https://github.com/lh3/bgt
34
35
@@ -40,7 +41,7 @@ \section{Introduction}
40
41
VCF/BCF~\citep {Danecek:2011qy } is the primary format for storing and analyzing
41
42
genotypes of multiple samples. It however has a few issues. Firstly, as a
42
43
streaming format, VCF compresses all types of information together. Retrieving
43
- site annotations or the genotypes of a few samples usually requires to decoding
44
+ site annotations or the genotypes of a few samples usually requires to decode
44
45
the genotypes of all samples, which is unnecessarily expensive. Secondly, VCF
45
46
does not take advantage of linkage disequillibrium (LD), while using this
46
47
information can dramatically improve compression ratio~\citep {Durbin:2014yq }.
@@ -49,8 +50,11 @@ \section{Introduction}
49
50
ambiguity complicates annotations, query of alleles and integration of multiple
50
51
data sets. At last, most existing VCF-based tool chains do not support
51
52
expressive data query. We frequently need to write scripts for advanced
52
- queries, which costs both development and processing time. BGT is designed to
53
- overcome or alleviate these issues.
53
+ queries, which costs both development and processing time.
54
+ GQT~\citep {Ruan:2015ab } attempts to solve some of these issues. While it is
55
+ very fast for selecting a subset of samples and for traversing all sites, it
56
+ discards phasing, is inefficient for region query and is not compressed well.
57
+ These observations motivated us to develop BGT.
54
58
55
59
\begin {methods }
56
60
\section {Methods }
@@ -110,27 +114,28 @@ \subsubsection{PBWT overview}
110
114
$ S_{k-1}$ , computing $ \phi _k\to S_k\to A_k$ derives $ A_k$ from $ B_k$ .
111
115
112
116
When there are strong correlations between adjacent rows, which is true for
113
- haplotype data due to linkage disequillibrium , $ 0 $ s and $ 1 $ s tend to form long
117
+ haplotype data due to LD , $ 0 $ s and $ 1 $ s tend to form long
114
118
runs in $ B_k$ . This usually makes $ B_k$ much more compressible than $ A_k$ under
115
- run-length encoding. For our test data set, 32 thousand samples can be
116
- compressed to less than 200 bytes in average.
119
+ run-length encoding. For our test data set, 32 thousand genotypes in a row can
120
+ be compressed to less than 200 bytes in average.
117
121
118
122
\subsection {Query with BGT }
119
123
120
124
\subsubsection {Flat Metadata Format (FMF) }
121
125
122
- BGT introduces a new but simple text format, FMF, to manage meta data. FMF is
123
- TAB-delimited with the first column showing the row name and following columns
124
- giving typed key-value pairs. An example looks like:
126
+ BGT introduces a new but simple text format, FMF, to manage meta data. FMF has
127
+ a similar data model to wide-column stores. It is TAB-delimited with the first
128
+ column showing the row name and following columns giving typed key-value pairs.
129
+ An example looks like:
125
130
\begin {center }
126
131
\begin {verbatim }
127
132
sample1 gender:Z:M height:f:1.73 foo:i:10
128
133
sample2 gender:Z:F height:f:1.64 bar:i:20
129
134
\end {verbatim }
130
135
\end {center }
131
136
where rows can be retrieved by an arbitrary expression such as
132
- `` height$ >$ 1.65'' . It has a similar data model to wide-column databases. BGT
133
- uses FMF to keep and query sample phenotypes and variant annotations.
137
+ `` height$ >$ 1.65'' . BGT uses FMF to keep and query sample phenotypes and variant
138
+ annotations.
134
139
135
140
\subsubsection {Query genotypes }
136
141
@@ -155,22 +160,23 @@ \subsection{BGT server}
155
160
BGT comes with a standalone web server frontend implemented in the Go
156
161
programming language. The server has a similar interface to the command line
157
162
tool, but with additional consideration of sample anonymity. With BGT,
158
- each sample has an attribute minimal group size or MGS. On query, the server
163
+ each sample has an attribute ` minimal group size' or MGS. On query, the server
159
164
refuses to create a sample group if the size of this group is smaller than the
160
165
MGS of one sample in this group. In particular, if a sample has MGS larger than
161
166
one, users cannot access its sample name and individual genotypes, but can
162
- retrieve allele counts computed together with other samples.
167
+ retrieve allele counts computed together with other samples. This prevents
168
+ users to access data at the level of a single sample.
163
169
164
170
\end {methods }
165
171
166
172
\section {Results }
167
173
168
174
We generated the BGT database for the first release of Haplotype Reference
169
- Consortium (HRC; http://bit.ly/HRC-org). The input is a BCF containing 32,488 samples across 39.2
170
- million sites on autosomes. The BGT file size is 7.4GB, 11\% of the
171
- genotype-only BCF. Decoding the genotypes of all
172
- samples across 142k sites in a 10Mbp region takes 11 CPU seconds, which amounts
173
- to decoding 420 million genotypes per second. This speed is even faster than
175
+ Consortium (HRC; http://bit.ly/HRC-org). The input is a BCF containing 32,488
176
+ samples across 39.2 million SNPs on autosomes. The BGT file size is 7.4GB, 11\%
177
+ of the genotype-only BCF, or 8 \% of GQT . Decoding the genotypes of all samples
178
+ across 142k sites in a 10Mbp region takes 11 CPU seconds, which amounts to
179
+ decoding 420 million genotypes per second. This speed is even faster than
174
180
computing allele counts and outputting VCF.
175
181
176
182
We use the following command line to demonstrate the query syntax of BGT:
@@ -182,59 +188,57 @@ \section{Results}
182
188
HRC-r1.bgt
183
189
\end {verbatim }
184
190
\end {center }
185
- It finds chr11 coding variants annotated in `var.fmf.gz' that have $ \ge $ 0.1\%
186
- frequency in the IBD data set (http://www.ibdresearch.co.uk) but absent from
191
+ It finds BRCA1 variants annotated in `var.fmf.gz' that have $ \ge $ 0.1\%
192
+ frequency in both the IBD data set (http://www.ibdresearch.co.uk) and
187
193
1000 Genomes~\citep {1000 -Genomes -Project -Consortium:2012aa }. In this command line, {\tt -G} disables the output of genotypes.
188
194
Option {\tt -a} selects variants with the `gene' attribute equal to `BRCA1'
189
195
according to the variant database specified with {\tt -d}. This condition
190
196
is indepenent of sample genotypes. Each option {\tt -s} selects a group of
191
- samples, again independent of sample genotypes . For the \char 35-th sample
197
+ samples based on phenotypes . For the \char 35-th sample
192
198
group/{\tt -s}, BGT counts the total number of called alleles and the number of
193
199
non-reference alleles and writes them to the {\tt AN\char 35} and {\tt
194
200
AC\char 35} aggregate variables, respectively. Option {\tt -f} then use these
195
201
aggregate variables to filter output.
196
202
197
- This command line takes 12 CPU seconds with most of time spent on reading
203
+ The command line above takes 12 CPU seconds with most of time spent on reading
198
204
through the variant annotation file to find matching alleles. The BGT server
199
205
reads the entire file into memory to alleviate the overhead, but a better
200
206
solution would be to set up a proper database for variant annotations.
201
207
202
208
\section {Discussions }
203
209
204
210
Given a multi-sample VCF, most BGT functionalities can be achieved with small
205
- scripts. However, as a command line tool, BGT is still an advance in several
206
- ways. Firstly, it saves development time. Extracting information from multiple
207
- files can be done with a command line instead of a script. Secondly,
211
+ scripts, but as a command line tool, BGT still has a few advantages. Firstly, it
212
+ saves development time. Extracting information from multiple files can be done
213
+ with a command line instead of a script. Secondly,
208
214
BGT saves processing time. With high-performance C code at the core, BGT is
209
215
much faster than processing VCF in a scripting language such as Perl or Python.
210
216
For example, deriving allele counts in a 10Mbp region for the HRC data takes 30
211
217
seconds with BGT, but doing the same with a Perl script takes 40 minutes, a
212
218
80-fold difference. Thirdly, the design of one non-reference allele per record
213
- and the separation of variant annotation from genotype data helps more scalable
214
- data processing. These design choices make BGT merge is much simpler and faster
215
- than generic VCF merge. This enables efficient query across multiple BGT
216
- databases which is not practical with VCF.
217
-
218
- The BGT server tries to solve a bigger problem: data sharing. A key feature of
219
- the server is to make the user-requested summary of genotypes available while
220
- keeping samples unidentifiable~\citep {Stade:2014ty }. Instead of always
219
+ makes BGT merge much simpler and faster than generic VCF merge. This enables
220
+ efficient query across multiple BGT databases which is not practical with VCF.
221
+
222
+ The BGT server tries to solve a bigger problem: data sharing. Instead of always
221
223
delivering full data in VCF, projects could have a new option to serve data
222
- publicly with BGT, letting users select the summary statistics of interest
223
- without breaching privacy.
224
+ publicly with the BGT server, letting users select the summary statistics of interest
225
+ on the fly while keeping samples unidentifiable. This is an improvement to
226
+ \citet {Stade:2014ty } which only provide precomputed summary.
224
227
225
228
% summary data is smaller, but genotypes in PBWT is not large, either.
226
229
227
230
We acknowledge that our MGS-based data sharing policy might have oversimplified
228
231
real scenarios, but we believe this direction, with proper improvements and
229
232
more importantly the approval of ethical review boards, will be more open,
230
233
convenient, efficient and secure than our current
231
- share-everything-with -permission model.
234
+ share-everything-under -permission model.
232
235
233
236
\section* {Acknowledgement }
234
237
I am grateful to HRC for granting the permission to use the data for evaluating
235
- the performance of BGT and thank the Global Alliance for the helpful
236
- suggestions.
238
+ the performance of BGT and thank the Global Alliance Data Working Group for the
239
+ helpful suggestions.
237
240
\paragraph {Funding\textcolon } NHGRI U54HG003037; NIH GM100233
238
241
239
242
\bibliography {bgt}
243
+
240
244
\end {document }
0 commit comments