-
Notifications
You must be signed in to change notification settings - Fork 59
Standard Output
Pilon version 1.14 Sat Oct 31 14:30:00 2015 -0400
Genome: genome.fasta
Fixing bases, gaps, local, breaks
Pilon starts out by printing a version string, which for a normal release is something simple like "version 1.5". Pilon then logs the inputs genome being used and the types of fixes Pilon will attempt; "pilon --help" gives a pretty good description:
- "bases": try to fix individual bases and small indels;
- "amb": fix ambiguous bases in fasta output (to most likely alternative).
- "gaps": try to fill gaps;
- "local": try to detect and fix local misassemblies;
- "all": all of the above (default);
- "none": none of the above; new fasta file will not be written.
The following are experimental fix types:
-
"breaks": allow local reassembly to open new gaps (with "local").
-
"novel": assemble novel sequence from unaligned non-jump reads.
Input genome size: 4411708 Scanning BAMs C1A4BACXX.3.Solexa-126846.aligned.bam: 4651320 reads, 0 filtered, 3432664 mapped, 551837 proper, 662674 stray, FR 100% 2500+/-1224, max 6172 C1A4BACXX.2.Solexa-126846.aligned.bam: 4572660 reads, 0 filtered, 3380490 mapped, 2507108 proper, 662146 stray, FR 100% 2500+/-1225, max 6175 D1CLVACXX.1.Solexa-125092.aligned.bam: 9219486 reads, 1170850 filtered, 6693087 mapped, 5251563 proper, 1329064 stray, FR 100% 227+/-71, max 439
Pilon does an initial scan of the input BAM files to generate some statistics, such as the read pair orientation and the mean and standard deviation of the insert sizes. Unless the "--nostrays" option is used, Pilon will begin by building a map of "stray" read pairs, which are read pairs for which both reads align but are not marked as "proper pairs" in the BAM. These could be pairs with alignments of incorrect orientation or the wrong separation to be considered valid. This map is used to find additional reads during the local reassembly process. The stray map can consume quite a bit of memory, so "--nostrays" may be appropriate for large genomes (though it will make gaps filling, etc., less effective).
Processing gi|395136682|gb|CP003248.1|:1-4411708
As Pilon begins to process each region of the genome, it will print the coordinates of the region. For smaller genomes, this is normally a whole FASTA element in the input genome (scaffold, reference chromosome or plasmid). However, FASTA elements larger than 10Mb will be broken into chunks of <10Mb to be processed.
jumps C1A4BACXX.3.Solexa-126846.aligned.bam: coverage 42
jumps C1A4BACXX.2.Solexa-126846.aligned.bam: coverage 41
frags D1CLVACXX.1.Solexa-125092.aligned.bam: coverage 96
Total Reads: 14142168, Coverage: 179, minDepth: 18
Pilon then processes all the reads aligned to the region being processed, BAM by BAM, and builds pileups from them. During the process, it computes the coverage of "valid" reads, which are either aligned unpaired reads, or reads from aligned "proper pairs". The reported coverage is often considerably lower than the raw input coverage. In the TB example here, there are 100x of raw coverage in each jump BAM, and 200x in the frag BAM. Pilon also computes the mean insert size and standard deviation of the paried BAMs from the valid pairs (e.g., 226 mean, 71 standard deviation for the frags above) . The last line is the total valid coverage, along with the computed minimum depth used to call bases in the pileups (the default cutoff is 10% of the mean depth; see --mindepth in the help string).
Confirmed 4390463 of 4411708 bases (99.52%)
Starting with Pilon v1.4, it prints out the number and percentage of bases for which the input genome (refrence or starting assembly) is confirmed. If the input genome contains gaps (Ns), those bases are not counted as part of the total. Unconfirmed may be base changes or not callable due to low coverage.
Corrected 926 snps; 40 ambiguous bases; 28 small insertions totaling 99 bases; 32 small deletions totaling 110 bases
This is a tally of small event corrections which Pilon would make in constructing the output genome (pilon.fasta). Note that for ambiguous bases, this is the number which would be changed because the majority of evidence supports the change. It is not the total number of ambiguous bases found by Pilon, as some are ambiguous but with a minority of evidence supporting the change; to see those, you would have to look in the VCF. Also note that some of these changes may not appear as small events in the final output, as a larger event may swallow them.
# Attempting to fill gaps
If “gaps” is one of the --fix types specified (which it is by default), Pilon will attempt to fill gaps by recruiting read pairs for which one read is correctly aligned (size and orientation) to one of the gap flanks, as well as any unpaired reads which are partially aligned to a flank and extend into the gap.
fix gap: scaffold00001:82428-93547 82428 -0 +276 ClosedGap
When Pilon makes any change due to local re-assembly (gap filling or break fixing, see below), the results are always of this general form. The line starts with the type of fix: “fix gap” says this was a local reassembly was an attempt to fill a gap.
Next are the coordinates of the region in the input genome, in this case the gap was originally at 82428-93547 of the given input FASTA element. All Pilon genome coordinates are 1-based.
Next is a description of the change itself. Local reassembly results are always represented as (potentially) removing a block of sequence and (potentially) inserting a block of sequence at a coordinate: “82048 -0 +276” translates into “starting at coordinate 82048, remove zero bases and insert 276”; in other words, this is a pure insertion of sequence.
Next is a keyword classifying the type of fix. For gap filling, “ClosedGap” means the reassembly resulted in a closed (contiguous) solution from one flank to the other, completely filling the gap, with 276 bases in this case.
fix gap: scaffold00001:234634-234643 234632 -95 +0 ClosedGap
It is possible for a gap closure to involve removing rather than inserting sequence; in this case, Pilon determined that the original contig ends on either side of the gap overlapped by 95 bases in this case, so the closure only involved removing sequence.
fix gap: scaffold00001:1951717-1955581 1951717 -0 +2660 PartialFill
“PartialFill” is the tag given to gap closure attempts which result in gap end extension, but do not result in a full closure of the gap. In this case, Pilon was able to extend the gap ends by 2660 bases (left and right flank combined), but it wasn’t able to assembly completely across the gap. Pilon will extend the gap ends with the new sequence and leave a gap in the middle. In this case, the original gap was 3864 bases, and Pilon filled 2660, so it will insert the new sequence and leave reduced size gap of 1204 bases. If Pilon’s partial fill is larger than the estimate gap, it will leave a minimum size gap (default 10 Ns, but controlled by the --mingap option).
Pilon does not currently log gap closure attempts for which it is unable to extend the flanking contig ends, and it doesn’t bother doing a partial fill unless it adds at least 20 bases.
fix gap: scaffold00001:824708-829730 824650 -58 +462 PartialFill
Sometimes, a gap closure or partial fill will involve both removing and adding bases. This happens when, in the process of doing the assembly, Pilon found changes to one or both of the flanks, so it removed the sequence which was there and replaced it with the re-assembled block of sequence.
# Attempting to fix local continuity breaks
If "local" and/or "breaks" are among the fixes Pilon is attempting, it will look for evidence of breaks in local continuity and trigger a resassembly of the suspicious region. It moves out to the flanks of the suspicious region and finds reads anchored to the flanks, along with their paired mates. It builds an assembly graph from these reads and attempts to walk in from each flank, trying to find continuity across the problem area.
fix break: gi|395136682|gb|CP003248.1|:71573-71589 71584 -0 +37 BreakFix
The local reassembly for attempted fixes of contiguity breaks are similar in form to those for gap filling. The line starts with the type of fix: “fix break” says this was a local reassembly due to something which looked like a potential misassembly or break in contiguity from the input genome.
Next are the coordinates of the region which seemed to be suspicious in the first place, in this case 71573-71589 of the given input FASTA element. Note that this isn’t necessarily the region ultimately changed by the fix, it is the region which triggered the reassembly in the first place. Sometimes, the actual fix coordinate may be nearby but outside this region.
The fix summary “71584 -0 +37” translates into “starting at coordinate 71584, remove zero bases and insert 37”; in other words, this is a pure insertion of sequence.
Finally, there is a keyword token which describes the nature of the change Pilon made. “BreakFix” means it was a local reassembly which resulted in a closed solution different from what was originally in the input genome.
fix break: gi|395136682|gb|CP003248.1|:104769-104815 104764 -51 +52 BreakFix
In this case, we have another BreakFix, but the fix is: “starting at location 104764 in the input genome, remove 51 bases and replace them with 52 bases from the local reassembly”. The details can be found in the .changes and/or VCF files. It’s not uncommon for a block substitution like this to happen when there is a dense area of smaller changes, but together, they mess up the alignments in that area enough to trigger a local reassembly.
fix break: gi|395136682|gb|CP003248.1|:150638-150879 150894 -0 +86 OpenedGap
“OpenedGap” only happens when the “breaks” fix option is requested, which allows Pilon to open gaps when a local reassembly extends sequence from at least one of the gaps, but it cannot assemble a closure across the problem region. Typically this is enabled for variant calling, since large insertions are often only partially assembled. By default, we typically don’t enable this for assembly improvement, where we don’t routinely expect such large partial events. Here, Pilon was able to insert 86 bases starting at location 150894 and open a gap; from just the log, we can’t tell where the gap is (left, right, or somewhere in the middle of the 86 inserted bases); see VCF or .changes file for the details.
fix break: gi|395136682|gb|CP003248.1|:370928-370948 370935 -60 +0 BreakFix
The line above represents a pure deletion event of 60 bases; it is a “BreakFix”, so we know the reassembly found a closed solution, and it removes a block of bases without inserting anything.
fix break: gi|395136682|gb|CP003248.1|:580778-580800 580698 -103 +120 OpenedGap TandemRepeat 77
If the local reassembly encounters a tandem repeat of non-trivial size, it will report it at the end of the reassembly event line (new in Pilon v1.5, not yet released). The last number is the size of the tandem repeat found, 77 in this case. Tandem repeats are detected via loops in the assembly graph, and the default assembly Kmer size is 47. That means it won’t find TRs unless they have a total length (all consecutive copies) >47, and the reported length will always be the next multiple of the TR length >47. In this case, the actual TR length must be 77.
fix break: gi|395136682|gb|CP003248.1|:802416-802515 802504 -0 +73 OpenedGap TandemRepeat 54
In this case, the reported TR length is 54, but the actual repeat could actually be of size 54, 27, 18, or 9, as in each case, 54 would be the “next multiple of the size >47” (see description above).
# fix break: gi|395136682|gb|CP003248.1|:1096285-1096349 0 -0 +0 NoSolution
When Pilon prints log lines beginning with “#”, the intent is for them to be informational rather than a fixed format, machine-parsable format. Here we have a reassembly result with no fix coordinate and tagged as “NoSolution”. What this means is that a local reassembly was triggered surrounding the reported potential problem area (coords 1096285-1096349), but PIlon couldn’t assemble anything meaningful. In other words, Pilon couldn’t extend beyond either flank for more than a handful of bases. One reason this can happen is that the problem area requiring reassembly is actually larger than what Pilon detected. “NoSolutions” are logged this way as something an analyst might to look at manually; something is almost certainly wrong there, but Pilon couldn’t make it better.
Local reassemblies which result in a closed solution which confirms the input genome in the region (no changes) are not logged unless the --verbose options is specified, in which case they are logged as “NoChange”.
Writing gi|395136682|gb|CP003248.1| VCF to pilon.vcf
Writing gi|395136682|gb|CP003248.1| changes to pilon.changes
Writing updated gi|395136682|gb|CP003248.1|pilon to pilon.fasta
Finally Pilon logs the output files it is creating as it writes them.