-
Notifications
You must be signed in to change notification settings - Fork 23
Tutorial Curation Common Cases
Back to the Main Curation Tutorial Page
This is a sort of bestiary of common curation cases and how to resolve them.
You'll want to keep the GenomeDiff File Format specification handy for understanding how we manually curate mutation entries.
IS elements (simple bacterial transposons) are are often drivers of microbial evolution. They can insert into genes and disrupt their function. Because multiple copies of IS elements in a genome also provide homologous sequences for recombination, they also catalyze deletions. In longer evolution experiments, multi-step events can occur where a new IS copy inserts at a location and then there is recombination between that copy and another new IS element insertions nearby, resulting in a deletion that leaves behind one IS copy.
Note
You will only get clean predictions of IS element endpoints if they are annotated in your reference genome. It is highly recommended that you do this! For more information, see [Reference-Sequence-File-Formats](Reference Sequence File Formats).
For these events, you will typically see three unassigned evidence items, two JC items that are between unique positions in the genome and opposite sides of the same IS element and one MC item that indicates the region between the corresponding reference positions where the IS elements inserted has been deleted in the sample.
How do we annote these events in the GenomeDiff file so that we can generate the resulting genome sequence AND properly count the number of IS element insertion events that have occurred in the lineage leading to the sequenced genome?
Three different cases are possible.
This situation looks like this:
The parentheses around Intermediate 2 indicate that this may be a short-lived intermediate. That is, the second IS element insertion and the deletion may happen at the same time. Maybe the new IS element inserts and during repair of an IS insertion intermediate and homologous recombination chooses the other copy as a template, resulting in loss of the region between and one IS copy in the repaired genome.
Here's a concrete example of how we annotated one of these event that occurred in population A–5 from the E. coli LTEE.
First, we sequenced an earlier genome (A–5 50000-generation Clone REL11340) and see this IS element insertion predicted by breseq:
MOB Mutation (A–5 50000-generation Clone REL11340)
Second, we sequenced a later genome (A–5 75000-generation Clone B). In the breseq results, we don't see the IS element insertion anymore. Instead, we see the MC and matching JC unassigned evidence items indicative of an IS element insertion followed by an IS-mediated deletion.
Unassigned MC Evidence (A–5 75000-generation Clone B)
Unassigned JC Evidence (A–5 75000-generation Clone B)
Notice how the JC connect the nucleotides before and after the boundaries of the MC to opposite sides of the IS1 element.
We would annotate this event with two lines in the GenomeDiff file:
MOB 1000 . REL606 305569 IS1 -1 9
DEL 1001 . REL606 305569 3817 mediated=IS1 within=1000:2
There's an advanced GenomeDiff tag being used here: within. The syntax for this is within=mutation_id:copy. It is separated from the rest of the line by a tab.
The within information is necessary because gdtools APPLY will first add the IS1 element insertion with a nine-base pair duplication. After this, there are two different places that have the original coordinates 305569-305578 in the genome that is being constructed, one before and one after the new IS1 element insertion. We need the deletion to begin within the second of these coordinates so it is after the IS element and removes the newly duplicated bases.
Alternately, you can annotate the mutations in this way:
MOB 1000 . REL606 305569 IS1 -1 9
MOB 1001 . REL606 309386 IS1 -1 9 before=1002
DEL 1002 . REL606 305569 3826 between=IS1 within=1000:2 apply_size_adjust=-9
This latter method should be used if you have sequenced other genomes from the same population and found some that have both IS element insertions but not the deletion between them (ones that look like Intermediate 2).
There's a lot going on here, which is why you'd normally do the simpler method above if possible.
First, we added the before=1002 tag to make sure the second MOB occurs before the deletion, because we need to remove the IS element and its left-side target site duplication with the deletion. (These tags have the syntax before=mutation_id). Second, the size of the DEL is increased by nine base pairs (3817 + 9 = 3826). This allows the DEL to include the size of the second MOB when it is being applied (because its start is before where that inserts and its end is after where it inserts). Third, we use a special apply_size_adjust=-9 tag to decrease the size of the deletion by the size of the target site duplication of the second MOB. (These tags have the syntax apply_size_adjust=offset.) This makes it so we don't delete the second copy of the target site deletion, because those bases (309386-309396) are preserved. We can't do this by decreasing the size of the DEL because then it would completely contain and delete the second MOB.
Note
While both of these will result in the same final genome sequence if
you use them with gdtools APPLY There is a subtle difference in how breseq
will analyze the resulting files with gdtools COUNT. In the first case, with
two lines it will count two mutations, because it assumes that the second IS
element insertion and the deletion happened as one event. In the second case it
will count three mutations. In both cases, it will count that there were two
new IS element insertions involved in those mutations. A further assumption of the
second method is that we know the duplication size of the second IS1 insertion.
What if the IS element insertion on the right had occurred first?
The situation would look like this:
We would annotate this event with two lines in the GenomeDiff file.
MOB 1000 . REL606 309386 IS1 -1 9
DEL 1001 . REL606 305578 3817 mediated=IS1 within=1000:1
Here the DEL overlaps the first copy of the IS target site duplication.
Sometimes you will only have the sequence of the final sample. In this case, you can't unambiguously annotate the details of the first intermediate. In this case, we recommend annotating the left IS element first with a target site insertion of zero base pairs (essentially indicating it is unknown). Then annotate the deletion to the right of it.
For the example above, you would use this:
MOB 1000 . REL606 305569 IS1 -1 0
DEL 1001 . REL606 305570 3808 mediated=IS1
Notice that we don't need the within tag because we moved the position where the deletion
begins to right after the IS element and it has zero for the target site duplication. We also
decreased the size of the deletion by 9 bases relative to the above scenarios because
we do not need to delete the nine duplicated target site bases (they are never added).
To be added!
To be added!
Next: Common curation cases
Quick Start
Installation
Test Drive
More Options
Usage: breseq
Usage: gdtools
More Information
GenomeDiff File Format
Reference Sequence File Formats
Output
Methods
Bibliography
FAQ
More Examples
Tutorial: Clones
Tutorial: Populations
Tutorial: Barcoded/Targeted
Tutorial: Curation
Contribute
Developer