|
| 1 | +# The Atomic VCF Format (ACF) |
| 2 | + |
| 3 | +The Atomic VCF format (ACF) is a strict subset of VCF to describe genetic |
| 4 | +sequence variations in a population. Unlike VCF which may encode multiple |
| 5 | +variants on one line (aka record), ACF only encodes one atomic allele per |
| 6 | +record. The representation for a set of variants is unique in ACF. BGT always |
| 7 | +converts VCF to ACF. |
| 8 | + |
| 9 | +## Definitions |
| 10 | + |
| 11 | +Given a reference genome, an *allele* is a 4-tuple (`ctg`,`pos`,`len`,`seq`), |
| 12 | +indicating sequence `seq` replaces a `len`-long reference subsequence at |
| 13 | +`ctg`:`pos`. An allele is a *reference allele* (or REF) if `seq` is the same |
| 14 | +as the reference subsequence; otherwise the allele is an *alternate allele* (or |
| 15 | +ALT). An ALT is *atomic* if it is a single substitution, a single insertion or |
| 16 | +a single deletion. |
| 17 | + |
| 18 | +In ACF, each record encodes one atomic ALT. For a particular allele *X*, if |
| 19 | +there are other allele(s) overlapping with *X*, these other allele(s) will be |
| 20 | +indicated by a symbolic allele `<M>`. `<M>` plays the same role as the `*` |
| 21 | +allele in VCF for deletions, but it is more general. Because we use `<M>` for |
| 22 | +all types of other overlapping alleles, in ACF, the allele number in the GT |
| 23 | +field can only be `.`, `0`, `1` or `2`. In BGT, these four allele numbers are |
| 24 | +encoded with 2 bits. |
| 25 | + |
| 26 | +## Examples |
| 27 | + |
| 28 | +### A simple example |
| 29 | + |
| 30 | +```txt |
| 31 | +Pos: 12345678901234567890123 4567890123 |
| 32 | +Ref: XXXXXXXXXCATATGCAAGTCGT-TATTAGAGCTXXXXX |
| 33 | +H1: XXXXXXXXXCATGTGC--GTCGTATATT----CTXXXXX |
| 34 | +H2: XXXXXXXXXCATATGCAAGTCGTATATT--AGCTXXXXX |
| 35 | +``` |
| 36 | +The corresponding ACF is |
| 37 | +```txt |
| 38 | +chr1 13 A G .. 1|0 |
| 39 | +chr1 16 CAA C .. 1|0 |
| 40 | +chr1 23 T TA .. 1|1 |
| 41 | +chr1 27 TAGAG T,<M> .. 1|2 |
| 42 | +chr1 27 TAG T,<M> .. 2|1 |
| 43 | +``` |
| 44 | +On the first three records, there are no overlapping variants. ACF is identical |
| 45 | +VCF in this case. There are two overlapping deletions at pos 27. In ACF, we |
| 46 | +describe each separately and use `<M>` as a placeholder for alleles not |
| 47 | +described on the line. When we parse ACF at pos 27, we have to read both |
| 48 | +records to reconstruct the underlying alignment. |
| 49 | + |
| 50 | +### A contrived example |
| 51 | + |
| 52 | +```txt |
| 53 | +Pos: 123456789012345678 90 |
| 54 | +Ref: XXXXXXXXXGTATATAGC-GAXXXXX |
| 55 | +H1: XXXXXXXXXGTATA-------XXXXX |
| 56 | +H2: XXXXXXXXXG------GCTGAXXXXX |
| 57 | +``` |
| 58 | +Can be encoded in ACF as |
| 59 | +```txt |
| 60 | +chr1 10 GTATATA G,<M> .. 2|1 |
| 61 | +chr1 14 ATAGCGA A,<M> .. 1|2 |
| 62 | +chr1 18 C CT,<M> .. 2|1 |
| 63 | +``` |
| 64 | +Similarly, an ACF parser has to memorize all three records to correctly |
| 65 | +reconstruct the haplotypes. |
| 66 | + |
| 67 | +### A multi-sample example |
| 68 | + |
| 69 | +The following VCF |
| 70 | +```txt |
| 71 | +11 101 GCGT G,GCGA,GTGA,CCGT .. 0|1 1|2 2|3 2|4 |
| 72 | +``` |
| 73 | +can be converted with `bgt atomize -MS` to ACF |
| 74 | +```txt |
| 75 | +11 101 G C,<M> .. 0|2 2|0 0|0 0|1 |
| 76 | +11 101 GCGT G,<M> .. 0|1 1|2 2|2 2|2 |
| 77 | +11 102 C T,<M> .. 0|2 2|0 0|1 0|0 |
| 78 | +11 104 T A,<M> .. 0|2 2|1 1|1 1|0 |
| 79 | +``` |
| 80 | + |
| 81 | +## An Extension to ACF |
| 82 | + |
| 83 | +A major issue with ACF is its inconenience: when there are overlapping |
| 84 | +variants, we don't know what `<M>` or allele number `2` refers to. We have to |
| 85 | +combine multiple records to reconstruct genotypes. A potential workaround is to |
| 86 | +introduce VCF INFO and FORMAT key-value pairs to spell out `<M>`. For the 2nd |
| 87 | +example, we can |
| 88 | +```txt |
| 89 | +chr1 10 GTATATA G,<M> A2=chr1_14_7_A GT:A2 2|1:3|1 |
| 90 | +chr1 14 ATAGCGA A,<M> A2=chr1_10_7_G,chr1_18_1_CT GT:A2 1|2:1|3,4 |
| 91 | +chr1 18 C CT,<M> A2=chr1_14_7_A GT:A2 2|1:3|1 |
| 92 | +``` |
| 93 | +Here the `A2` INFO tag uses a succinct way to encode the list of overlapping |
| 94 | +alleles; the `A2` FORMAT tag indexes into the combine ALT and A2 array and |
| 95 | +gives the complete genotype. |
0 commit comments