-
Notifications
You must be signed in to change notification settings - Fork 23
Tutorial Curation Editing GenomeDiff Files
Back to the Main Curation Tutorial Page
As we mentioned, the underlying information displayed in the HTML output files is contained in the GenomeDiff output file generated by breseq. Let's look at this format in a bit more depth.
The GenomeDiff File Format page in the manual describes what you are seeing in depth, but here's a high-level description to orient you.
Here are some important parts of an example GenomeDiff file.
#=GENOME_DIFF 1.0
#=TITLE Ara-1_10000gen_4536A
#=AUTHOR Barrick JE
#=TIME 10000
#=POPULATION Ara-1
#=TREATMENT LTEE
#=CLONE A
#=MUTATOR_STATUS non-mutator
#=REFSEQ https://raw.githubusercontent.com/barricklab/LTEE/7da91974eafac0c5a8f903ae57275795d4395af2/reference/REL606.gbk
#=READSEQ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR030/SRR030255/SRR030255_1.fastq.gz
#=READSEQ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR030/SRR030255/SRR030255_2.fastq.gz
SNP 1 . REL606 380188 C
INS 2 . REL606 475292 G
DEL 3 . REL606 547700 8224 mediated=IS1
SNP 4 . REL606 649391 A
SNP 5 . REL606 683496 C
MOB 6 . REL606 969836 IS150 1 3
SNP 7 . REL606 1329516 T
MOB 8 . REL606 1544289 IS150 -1 3
MOB 9 . REL606 1733647 IS150 -1 3
SNP 10 . REL606 1976879 G
DEL 11 . REL606 2031703 23293
SNP 12 . REL606 2082685 A
SNP 13 . REL606 2499315 A
SNP 14 . REL606 3045069 T
Every GenomeDiff file has to begin with this header:
#=GENOME_DIFF 1.0
This tells any interpreter it is a GenomeDiff file and the version of the format that is used.
Other header lines of this form:
#=<setting> <value>
Can be used for storing "metadata". This information is used and output by various analysis commands.
After the header, the rest of a GenomeDiff file consists of lines that begin with a two- or three-letter code and then information separated by tabs. (It's very important that these are tabs and not spaces!)
The three-letter codes are for mutations. The second item in each line is a unique ID for that entry. The third item is a comma-separated list of other IDs for evidence that supports that mutation. This is the evidence that will pop up if you click on the links for a mutation prediction in the HTML output. In edited GenomeDiff files like the one above they may be replaced by . as an empty value.
To be added.
GenomeDiff files are just text files. So you can edit them in any way that you would do this for other text files. You will just have to be careful that your edits follow the expected syntax. Often you can figure this out by looking at and copying existing lines from other files. But remember the GenomeDiff File Format specification is there for you, should you need to check anything or understand some of the more advanced ways of specifying complex mutations and series of mutations.
Note
You will want to turn on the option to "view invisible characters" in your text editor when working with GenomeDiff files. Otherwise it is hard to tell if whitespace is a tab or a space!
breseq installs a separate utility command called gdtools (for GenomeDiff tools) for performing various operations on GenomeDiff files and analyses of sets of GenomeDiff files.
One simple command that can save you time just reads in and validates that there are no problems with the formatting in one or more GenomeDiff files.
gdtools VALIDATE -r run/data/reference.gff3 <input_to_to_validate.gd>
The -r reference file should be the same as the one you used to run breseq. You could use the input from your original breseq command. It is shown here using the copy that breseq makes in GFF3 format as part of output in the data directory, which is also suitable.
The gdtools utility has many other subcommands that can be useful for curating large sets of GenomeDiff files.
You can get a list of many other gdtools subcommands by running it with no arguments:
gdtools
Some useful ones for curation include:
COMPARE, SUBTRACT, MERGE, and REMOVE.
We'll cover using gdtools APPLY in the next section.
You can get the help for any of these subcommands by running the subcommand with no arguments. For example:
gdtools SUBTRACT
As an example of how these can be used, we work with a strain of Acinetobacter baylyi that has deletions of all of its transposons. Rather than running breseq against the reference genome of this strain (ADP1-ISx), we usually run it against the genome of the original ancestor (ADP1) that still has the transposons. This ensures that our mutations are predicted at the usual coordinates for this genome (rather than shifted by the deletions). However, this choice means that we re-predict the same transposon deletions as DEL entries in any genomes that we evolve from ISx. Since they didn't occur during our evolution experiment we don't want to count them.
We can generate a second set of GenomeDiff files in which they are removed by running commands like this:
gdtools SUBTRACT -o evolved_with_no_transposon_deletions.gd evolved.gd ADP1-ISx.gd
Quick Start
Installation
Test Drive
More Options
Usage: breseq
Usage: gdtools
More Information
GenomeDiff File Format
Reference Sequence File Formats
Output
Methods
Bibliography
FAQ
More Examples
Tutorial: Clones
Tutorial: Populations
Tutorial: Barcoded/Targeted
Tutorial: Curation
Contribute
Developer