Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Modify position of insertions - take it from the same read as the insertion allele is taken from.
I propose to change the way cuteSV calls insertions. I understand it records the average position of inserted sequences in the reads as the variant position, and the variant allele is the inserted sequence in one of the reads satisfying certain criteria. So the inserted position and the inserted sequence are not consistent, resulting in the variant haplotype to be wrong. I suggest a small change - to take the position and the inserted allele from the same read.
The attached example contains a 51 base heterozygous insertion in a tandem repeat, the inserted sequences in reads are scattered around in the 1:240699860-240700080 region.
cuteSV calls the insertion as following:
1 240699952 cuteSV.INS.0 A AGAGTTTAAGACAAAATGGCAGCGGGGGCTGTAGATCTGGAAGCCATCTGTA
where the position 240699952 is the average inserted position, and the 51 base variant allele is taken from one of the reads that have an insertion at 1:240699867. As a result the reconstructed variant haplotype is quite different from the consensus of the reads carrying the insertion:
where P1 is the variant haplotype, S1 is the read consensus.
The modified version of cuteSV calls the variant as following:
1 240699867 cuteSV.INS.0 G GGAGTTTAAGACAAAATGGCAGCGGGGGCTGTAGATCTGGAAGCCATCTGTA
Here the position and the variant allele are extracted from the same read. The corresponding haplotype aligns perfectly with the consensus:
Comparing haplotypes of a few thousands of insertions called by cuteSV to haplotypes of high confidence set of insertions one can see that the haplotypes called by the modified version of cuteSV are much closer to the real haplotypes (haplotype similarity scores are closer to one):
insertion_example.zip