Skip to content

Commit 1ccf6ce

Browse files
committed
ACF: a hypothetic format
1 parent c3a6e28 commit 1ccf6ce

File tree

1 file changed

+95
-0
lines changed

1 file changed

+95
-0
lines changed

acf.md

+95
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# The Atomic VCF Format (ACF)
2+
3+
The Atomic VCF format (ACF) is a strict subset of VCF to describe genetic
4+
sequence variations in a population. Unlike VCF which may encode multiple
5+
variants on one line (aka record), ACF only encodes one atomic allele per
6+
record. The representation for a set of variants is unique in ACF. BGT always
7+
converts VCF to ACF.
8+
9+
## Definitions
10+
11+
Given a reference genome, an *allele* is a 4-tuple (`ctg`,`pos`,`len`,`seq`),
12+
indicating sequence `seq` replaces a `len`-long reference subsequence at
13+
`ctg`:`pos`. An allele is a *reference allele* (or REF) if `seq` is the same
14+
as the reference subsequence; otherwise the allele is an *alternate allele* (or
15+
ALT). An ALT is *atomic* if it is a single substitution, a single insertion or
16+
a single deletion.
17+
18+
In ACF, each record encodes one atomic ALT. For a particular allele *X*, if
19+
there are other allele(s) overlapping with *X*, these other allele(s) will be
20+
indicated by a symbolic allele `<M>`. `<M>` plays the same role as the `*`
21+
allele in VCF for deletions, but it is more general. Because we use `<M>` for
22+
all types of other overlapping alleles, in ACF, the allele number in the GT
23+
field can only be `.`, `0`, `1` or `2`. In BGT, these four allele numbers are
24+
encoded with 2 bits.
25+
26+
## Examples
27+
28+
### A simple example
29+
30+
```txt
31+
Pos: 12345678901234567890123 4567890123
32+
Ref: XXXXXXXXXCATATGCAAGTCGT-TATTAGAGCTXXXXX
33+
H1: XXXXXXXXXCATGTGC--GTCGTATATT----CTXXXXX
34+
H2: XXXXXXXXXCATATGCAAGTCGTATATT--AGCTXXXXX
35+
```
36+
The corresponding ACF is
37+
```txt
38+
chr1 13 A G .. 1|0
39+
chr1 16 CAA C .. 1|0
40+
chr1 23 T TA .. 1|1
41+
chr1 27 TAGAG T,<M> .. 1|2
42+
chr1 27 TAG T,<M> .. 2|1
43+
```
44+
On the first three records, there are no overlapping variants. ACF is identical
45+
VCF in this case. There are two overlapping deletions at pos 27. In ACF, we
46+
describe each separately and use `<M>` as a placeholder for alleles not
47+
described on the line. When we parse ACF at pos 27, we have to read both
48+
records to reconstruct the underlying alignment.
49+
50+
### A contrived example
51+
52+
```txt
53+
Pos: 123456789012345678 90
54+
Ref: XXXXXXXXXGTATATAGC-GAXXXXX
55+
H1: XXXXXXXXXGTATA-------XXXXX
56+
H2: XXXXXXXXXG------GCTGAXXXXX
57+
```
58+
Can be encoded in ACF as
59+
```txt
60+
chr1 10 GTATATA G,<M> .. 2|1
61+
chr1 14 ATAGCGA A,<M> .. 1|2
62+
chr1 18 C CT,<M> .. 2|1
63+
```
64+
Similarly, an ACF parser has to memorize all three records to correctly
65+
reconstruct the haplotypes.
66+
67+
### A multi-sample example
68+
69+
The following VCF
70+
```txt
71+
11 101 GCGT G,GCGA,GTGA,CCGT .. 0|1 1|2 2|3 2|4
72+
```
73+
can be converted with `bgt atomize -MS` to ACF
74+
```txt
75+
11 101 G C,<M> .. 0|2 2|0 0|0 0|1
76+
11 101 GCGT G,<M> .. 0|1 1|2 2|2 2|2
77+
11 102 C T,<M> .. 0|2 2|0 0|1 0|0
78+
11 104 T A,<M> .. 0|2 2|1 1|1 1|0
79+
```
80+
81+
## An Extension to ACF
82+
83+
A major issue with ACF is its inconenience: when there are overlapping
84+
variants, we don't know what `<M>` or allele number `2` refers to. We have to
85+
combine multiple records to reconstruct genotypes. A potential workaround is to
86+
introduce VCF INFO and FORMAT key-value pairs to spell out `<M>`. For the 2nd
87+
example, we can
88+
```txt
89+
chr1 10 GTATATA G,<M> A2=chr1_14_7_A GT:A2 2|1:3|1
90+
chr1 14 ATAGCGA A,<M> A2=chr1_10_7_G,chr1_18_1_CT GT:A2 1|2:1|3,4
91+
chr1 18 C CT,<M> A2=chr1_14_7_A GT:A2 2|1:3|1
92+
```
93+
Here the `A2` INFO tag uses a succinct way to encode the list of overlapping
94+
alleles; the `A2` FORMAT tag indexes into the combine ALT and A2 array and
95+
gives the complete genotype.

0 commit comments

Comments
 (0)