Releases: broadinstitute/pilon
Pilon version 1.24
Pilon version 1.24 has one algorithm change which affects local reassembly solutions, particularly when using diploid data. Previously, if there were several equivalent solutions for a local reassembly (gap fill or continuity break fix), it would pick the first available solution. In v1.24, it picks the smallest change among equivalent alternatives, preventing it from including surrounding heterozygous (or ambiguous in the haploid case) SNPs in a larger changed block. This reduces both the size and number of BreakFix
solutions in the output from diploid data, since many of them contained spurious differences related to heterozygosity.
Support for the experimental --threads
option has been removed. It was implemented by an ugly hack which no longer works in modern scala, and it was a resource hog in any case. I hope to revisit multithreading again in the future.
Otherwise, v1.24 is a maintenance release, updating the code to compile on scala 2.13 and updating the htsjdk
library version to 2.23.0, allowing support for newer file formats such as csi indexes. Additionally, v1.24 was build using the Java 11 toolchain, so I recommend using a JRE version 11 or greater.
Pilon version 1.23
Pilon version 1.23 introduces two new experimental arguments to specify long read input BAMs:
--nanopore ont.bam
identifies ont.bam
as containing long reads from Oxford Nanopore sequencing
--pacbio pb.bam
identifies pb.bam
as containing long reads from Pacific Biosciences sequencing
In this version, the long read BAMs are only used for SNP and indel calling based on pileups. Long reads are not yet used for local reassembly or gap filling, but that will likely come in a future release. For development, I have been using minimap2
to generate the long read BAM files.
Currently, use of long reads is most effective in combination with Illumina --frags
libraries, so that Pilon can use the high base quality of the Illumina libraries for unique sequence and use the long reads to reach into repeat sequence to disambiguate embedded differences. It is possible to use only long reads as input to Pilon, but consider that very experimental.
There are limitations in Pilon's use of long reads in pileups: for both --pacbio
and --nanopore
libraries, Pilon does not attempt to call indels in homopolymer runs of 4 or more bases (e.g., AAAA...
), and for --nanopore
sequence, Pilon does not use the long reads to call the middle base of a CCxGG
motif, as the ONT base calls can be confused by methylation. So this is very basic long read support, but it has been effectively applied to more than a dozen bacterial genomes in conjunction with Illumina paired-end sequencing.
In addition to the long read support, v1.23 fixes a couple of bugs:
- Spurious long indels were occasionally called in pileups with minimal evidence
- A crash could occur when an indel was called at the beginning of a scaffold
Finally, this version updates the code base to use the Scala 2.12 compiler, uses a newer version of the htsjdk
library, and is packaged with the sbt-assembly
module instead of sbt-onejar
.
Pilon version 1.22
This is a very minor release incorporating two bug fixes reported by users:
- Fixed bug in
.bed
file coordinates generated by the--tracks
option (start coordinate was 1-based rather than 0-based); - More flexibility in
--target
specifications is now allowed so that scaffold names may contain colons (apparently Quiver generates these). Thanks to bwlang for sample code!
There should be no changes to results in this version other than the above.
--bruce, 15 Mar 2017
Pilon version 1.21
Version 1.21 introduces two new --fix
options for assembly improvement: snps
and indels
. Prior to this release, the bases
fix option was used to control alignment-based (as opposed to assembly-based) fixes of both SNPs and small indels. Now, fixing of SNPs and small indels can be controlled independently. The --fix bases
option is retained for back-compatibility, and is equivalent to specifying both snps
and indels
.
For example, to use Illumina data to try to get rid of potential suprious indels in a pacbio assembly without changing anything else, one could use --fix indels
.
This version also fixes an integer overflow bug caused genome sizes > 2Gb to be printed as negative size.
--bruce, 9 Dec 2016
Pilon version 1.20
This release only fixes a bug in the experimental PacBio long read circular element closure feature introduced in 1.19.
--bruce, 20 Sep 2016
Pilon version 1.19
Pilon version 1.19 includes a new experimental feature specifically for improving PacBio bacterial assemblies generated by HGAP/Falcon by identifying circular elements (chromosome or plasmids) and trimming them for circular continuity. If this new option is not used, 1.19 is identical in functionality to 1.18.
There is a new --fix
option called circles
, and it is requires an aligned bam of PacBio corrected long reads. For development purposes, I have been creating these using a command like:
blasr corrected.fasta submission.contigs.fasta -nproc <N> -sam -clipping soft -minPctIdentity 97
and then sorting and indexing the output BAM. Then Pilon can be called using this as an --unpaired
library along with --fix circles
. It can be combined with other options (e.g., --fix bases,circles
might be a common thing to try), but the circles
option will have no effect if you don't feed it an --unpaired
corrected long read file.
Pilon uses the long read alignment information to look for potential circular structures, then re-assembles across the ends to ensure correct continuity. At this time, it makes no attempt to join multiple input scaffolds/contigs together to close a circle, it just trims or extends the ends of an existing element. Multiple Pilon users have reported that HGAP PacBio assemblies often have extra stuff off the ends of circular elements, and this attempts to fix that particular issue.
If Pilon thinks an element may be circular and attempts to close it, it will output something like:
Attempting to close circle
fix circle: contig000002 306915 ClosedCircle 1 -6372 +0 313288 -6090 +0 306916
The first number after the contig name indicates the estimated length before reassembly. ClosedCircle
means it was successful, and then it prints the trimming changes it made as in other large fixes. In this case, it removed 6372 bases starting at coordinate 1 and 6090 bases starting at coordinate 313288, and the resulting length of the element is 306916. If it can't successfully re-assemble across the ends, it will print NoSolution
as it does for other failed reassemblies.
Please keep in mind this is a first attempt using limited sample data, and I'm happy to try to make improvements based on experience any of you have. Please share your experiences on the pilon-users mailing list.
Pilon version 1.18
This version adds a new --iupac
command line option which enables output of IUPAC ambiguous base codes in the output FASTA file. This will be most useful for diploid assembly improvement, allowing Pilon to include heterozygous SNP codes in the improved assembly FASTA. Pilon currently only makes two-way heterozygous calls (e.g., "C or T" is encoded "Y"), not 3-way.
--bruce, 12 June 2016
Pilon version 1.17
This release implements two enhancements requested by users:
- A new optional argument
--outdir <directory>
which specifies a place for Pilon to put all its output files. If you use this option, the naming of the individual files doesn't change, just the location. - In addition to
--frags
,--jumps
, and--unpaired
, Pilon now allows aligned reads to be simply specified as--bam <aligned-bam.bam>
. If the--bam
argument is used, Pilon will scan the BAM file (as it usually does anyway) to gather statistics about the orientation and insert size distribution of the reads in the library, and it will make its best determination as to whether they should be treated as small insert (fragment) pairs, large insert (jump) pairs, or unpaired. The heuristics are pretty simple: if the plurality of reads are unpaired, that's what it will use; otherwise, it determines whether most of the aligned read pairs are in FR or RF orientation, and uses the corresponding mean insert size to determine whether to consider them frags (< 700bp) or jumps (> 700bp). As always, separate libraries should each be in their own BAM file.
Thanks to Chris Desjardins and Ashlee Earl for suggesting the automatic BAM type determination and to Torsten Seemann for suggesting the output directory argument.
Pilon version 1.16
This release fixes a bug introduced in v1.15 which caused Pilon to crash when using input BAM files containing unpaired reads (--unpaired
). It is otherwise identical to v1.15.
--bruce, 7 Dec 2015
PIlon version 1.15
This release is aimed at improving performance and convenience for those applying Pilon to large genomes. There are no changes in functionality from v1.14, though on most of my test cases, v1.15 is 10-15% faster overall because of some efficiency improvements to finding reads for the local reassembly process.
More significantly, v1.15 contains an optimization for those who use the --targets
argument to specify a subset of input scaffolds to process during a run in order to reduce memory requirements for handling large genomes. It is my hope that this will make it more viable to use the --targets
argument to specify a file with a list of scaffolds to process from a large genome rather than having to split all the input files.
The first thing Pilon normally does is scan the BAMs to compute stats and create an in-memory data structure to hold any "stray" pairs, that is, pairs which are not mapped as "proper" pairs in the BAM. This is necessary to get easy access to mates which may be mapped far away in the input genome (and hence far away in the BAM file), as these faraway mates are prime candidates to be included in the local reassembly process. However, mapping stray pairs can be very memory intensive for large genomes, so starting with this version, Pilon will ignore any stray pairs for which neither read maps to any of the specified --target
scaffolds. This should increase speed and reduce memory for running a scaffold subset.