Skip to content

Tutorial Curation Editing GenomeDiff Files

Jeffrey Barrick edited this page Jun 12, 2024 · 4 revisions

Back to the Main Curation Tutorial Page

As we mentioned, the underlying information displayed in the HTML output files is contained in the GenomeDiff output file generated by breseq. Let's look at this format in a bit more depth.

The GenomeDiff File Format page in the manual describes what you are seeing in depth, but here's a high-level description to orient you.

Here are some important parts of an example GenomeDiff file.

#=GENOME_DIFF	1.0
#=TITLE	Ara-1_10000gen_4536A
#=AUTHOR	Barrick JE
#=TIME	10000
#=POPULATION	Ara-1
#=TREATMENT	LTEE
#=CLONE	A
#=MUTATOR_STATUS	non-mutator
#=REFSEQ	https://raw.githubusercontent.com/barricklab/LTEE/7da91974eafac0c5a8f903ae57275795d4395af2/reference/REL606.gbk
#=READSEQ	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR030/SRR030255/SRR030255_1.fastq.gz
#=READSEQ	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR030/SRR030255/SRR030255_2.fastq.gz
SNP	1	.	REL606	380188	C
INS	2	.	REL606	475292	G
DEL	3	.	REL606	547700	8224	mediated=IS1
SNP	4	.	REL606	649391	A
SNP	5	.	REL606	683496	C
MOB	6	.	REL606	969836	IS150	1	3
SNP	7	.	REL606	1329516	T
MOB	8	.	REL606	1544289	IS150	-1	3
MOB	9	.	REL606	1733647	IS150	-1	3
SNP	10	.	REL606	1976879	G
DEL	11	.	REL606	2031703	23293
SNP	12	.	REL606	2082685	A
SNP	13	.	REL606	2499315	A
SNP	14	.	REL606	3045069	T

Metadata Lines

Every GenomeDiff file has to begin with this header:

#=GENOME_DIFF	1.0

This tells any interpreter it is a GenomeDiff file and the version of the format that is used.

Other header lines of this form:

#=<setting> <value>

Can be used for storing "metadata". This information is used and output by various analysis commands.

Mutation Lines (three-letter codes)

After the header, the rest of a GenomeDiff file consists of lines that begin with a two- or three-letter code and then information separated by tabs. (It's very important that these are tabs and not spaces!)

The three-letter codes are for mutations. The second item in each line is a unique ID for that entry. The third item is a comma-separated list of other IDs for evidence that supports that mutation. This is the evidence that will pop up if you click on the links for a mutation prediction in the HTML output. In edited GenomeDiff files like the one above they may be replaced by . as an empty value.

Evidence Lines (two-letter codes)

To be added.

Editing GenomeDiff Files

GenomeDiff files are just text files. So you can edit them in any way that you would do this for other text files. You will just have to be careful that your edits follow the expected syntax. Often you can figure this out by looking at and copying existing lines from other files. But remember the GenomeDiff File Format specification is there for you, should you need to check anything or understand some of the more advanced ways of specifying complex mutations and series of mutations.

Note

You will want to turn on the option to "view invisible characters" in your text editor when working with GenomeDiff files. Otherwise it is hard to tell if whitespace is a tab or a space!

breseq installs a separate utility command called gdtools (for GenomeDiff tools) for performing various operations on GenomeDiff files and analyses of sets of GenomeDiff files.

One simple command that can save you time just reads in and validates that there are no problems with the formatting in one or more GenomeDiff files.

gdtools VALIDATE -r run/data/reference.gff3 <input_to_to_validate.gd>

The -r reference file should be the same as the one you used to run breseq. You could use the input from your original breseq command. It is shown here using the copy that breseq makes in GFF3 format as part of output in the data directory, which is also suitable.

Advanced gdtools commands for GenomeDiffs

The gdtools utility has many other subcommands that can be useful for curating large sets of GenomeDiff files.

You can get a list of many other gdtools subcommands by running it with no arguments:

gdtools

Some useful ones for curation include:

COMPARE, SUBTRACT, MERGE, and REMOVE.

We'll cover using gdtools APPLY in the next section.

You can get the help for any of these subcommands by running the subcommand with no arguments. For example:

gdtools SUBTRACT

As an example of how these can be used, we work with a strain of Acinetobacter baylyi that has deletions of all of its transposons. Rather than running breseq against the reference genome of this strain (ADP1-ISx), we usually run it against the genome of the original ancestor (ADP1) that still has the transposons. This ensures that our mutations are predicted at the usual coordinates for this genome (rather than shifted by the deletions). However, this choice means that we re-predict the same transposon deletions as DEL entries in any genomes that we evolve from ISx. Since they didn't occur during our evolution experiment we don't want to count them.

We can generate a second set of GenomeDiff files in which they are removed by running commands like this:

gdtools SUBTRACT -o evolved_with_no_transposon_deletions.gd evolved.gd ADP1-ISx.gd

Next: Validating your predictions

Clone this wiki locally