-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong position #43
Comments
Hello @Akazhiel, Thank you for your interest in cuteSV and use it. This is not a wrong INS call. Before explaining this issue, let me introduce the production of the insertion breakpoint briefly. cuteSV firstly extracts the INS signatures from the BAM file and generates the SIGS files. After that, cuteSV sorts the INS signatures by position and performs clustering. Finally, cuteSV determines the final INS from each cluster and calculates the genotypes. As you can see, the determination of the cluster is very important for producing the final INS call. cuteSV applies a parameter named max_cluster_bias_INS to decide which signatures will be taken into account. That is, the two INS signatures will be used for the same INS if their maximum distance less than the value. Usually, the default value of max_cluster_bias_INS is set as 100 for CLR/ONT and 1000 for CCS datasets respectively. From your example, there are lots of INS signatures in the above chr1:143202274. For the limitation of the range of the window of the IGV is only 41 bp, I believe there must be other INS signatures in the positions larger than chr1:143202274 but less than the max_cluster_bias_INS. As a result, cuteSV reputes these signatures as sources for the generation of the final INS chr1:143202282. Due to there are enough reads to support this INS which also has a great variant quality, cuteSV gives PRECISE tag and PASS filter finally. Also, the base T on the chr1:143202282 is treated as the reference allele. To solve this, you can decrease the max_cluster_bias_INS value to exclude few outliers. But this will sacrifice the recall especially for those calls whose signatures are extremely messy. Besides, there are indeed several SVs with messy signatures that have unintuitive breakpoints, which I guess are the calls you mentioned in EDIT at last. Hope this answer can help you solve your confusion! Best, |
Hello @tjiangHIT, I understand this max_cluster_bias_INS distance is measured in bp? If that is the case there are no other insertions in a range of 100bp from the one shown. That is why it seemed odd for cuteSV to place the INS in the wrong position when checking the alingments in the IGV. I think in this image it can be better appreciated how there are no other INS in the specified range. There's also the fact that the reference allele for the position reported by cuteSV is not a T but an A, or at least that's what the reference sequence says, it could be it says T since it's the complementary, I'm unsure as to how it handles all that. But, if you were to check the position at which the actual insertion happens which is at 143202274 you can observe that the reference allele in the sequence is actually a T. As for the EDIT if you check the picture in this post, on the rightmost position you can see a DEL labeled as PRECISE but nonetheless it doesn't span the whole supposed deletion and this supposed deletion can be observed in both BAM files yet it's only reported in one of them. Could this be due to these BAMs being subsets since I don't have enough computation power to run a whole sample and thus not having enough supporting reads to call the mutation accurately? Best regards, |
Hello @Akazhiel, cuteSV extracts INS signature not only from the intra-alignment (there is an over 50 bp I in cigar field), but also from the inter-alignment (with 2 or more split alignments). And those INS signatures from split-alignments might be hard to see from IGV. So could you supply the detailed INS signatures at the range of 1000 bp around chr1:143202274 from the file named INS.sigs? For the next point, I am sorry for my carelessness. Actually, cuteSV computes the final breakpoint of the INS from all signatures, and then it extracts the base from the reference genome on the location of breakpoint-1 as the final REF allele. As for the deletion, there are respectively 13 and 8 reads at least support this deletion from the second shot-screen of IGV. Please check the value of the parameter --min_support used in your practice. The default value of --min_support is 10 and this will ignore the deletion call with about 8 supporting reads. Best, |
Hello @tjiangHIT, Here it's attached the subset you asked for. Regarding the reference allele, I'm still not sure I understand it since if I'm not wrong the breakpoint-1 is the first position that is reported for that variant in the VCF file, which in this case is 143202282, but still the REF allele from the reference genome in that position is still not a T but an A, so it is wrongly reported. And for the --min-support I understand it now, but I still don't understand why the deletion variant didn't span the whole area that can be seen in the image but just a small part, I guess it has to be related to how the breakpoint is computed with the reads that support that variant. Best regards, |
I agree with @Akazhiel . Whatever limitation it can be, even it can be false positive SV call. But, the position and ref allele should not be different. In my case, I have noticed it in all variants irrespective of signatures, deletion or insertion. |
Hello @Akazhiel and @kpalanikannan, Thank you for sharing the INS signatures. As you can see the Table below, there are 12 INS signatures with bold POS used for generating the final INS. The average POS is 143202281. Considering the breakpoint in VCF is 1-based, cuteSV output 143202282 as the final breakpoint for this INS. Due to the insertion appearing at 143202282, cuteSV treats the base on 143202281 from reference genome as the reference allele.
Do you think the breakpoint and the REF allele of this INS should be 143202281 or 143202282? I have to say that this is a problem that has plagued me for a long time. Best, |
Hello @tjiangHIT, Thanks for the quick replies and always taking the time to explain everything in a clear way. Your message actually helped to understand why the reference allele in the VCF is a T and not an A since as you can see in the IGV the reference for position 143202282 is an A. So if I understood correctly, the base that is used as a REF in the VCF is actually the POS - 1 but still the REF position reported in the VCF is POS, in this case 143202282 and the end of the breakpoint is this same position. Maybe the REF position in the VCF should be the average POS that is returned by the pipeline and the end of the breakpoint be this same position since it's an insertion and it's also how it's reported in databases such as gnomAD. Nonetheless, I still have doubts regarding the INS signatures since in this table the positions are different, but if you check the IGV screenshots I attached previously, you can see how most insertions are located in the same place. What could be the reason of this difference? Maybe it's just me misunderstanding insertions since I'm new to all the structural variation field. |
Hello @Akazhiel, I have reviewed your first IGV shot screen, there are over 20 reads containing the INS signatures. But cuteSV only applied about 12 signatures, which are scattered than those in IGV, to finish the INS calling. I think the key point might be that several reads containing signatures are with low mapping quality or many numbers of split-alignments, and cuteSV will discard those reads that have less than 20 mapping quality or over 7 split parts. Best, |
Hello @tjiangHIT, Reading your answers I reach the conclusion that although you can see in the screenshots multiple insertions that are in the same position, this is not the position reported in the VCF file and that could be due to the reasons you have stated which is the possible low mapping quality and the split parts. How can I check the split parts in my alignment? Best regards, |
Hello @Akazhiel , First, you can have a look at the alignment file and in the 18th column, there will be a string started with 'SA' which represents the supplementary alignments. In this field, you can check the split parts of this read. Best, |
Hello,
I've noticed that some variants are reported as PRECISE and with a PASS filter but when taking a look at the alignment in the IGV the position reported is wrong in comparison to the actual precision of the structural variant.
As it can be observed the VCF file reports an insertion at position 143202282 in chromosome 1, while the actual position that can be observed is at 143202274 at chromosome 1. There's also the fact that the VCF states the reference is a T while in the position reported is an A, nonetheless in the actual position of the insertion the reference is indeed a T. Why could this be happening?
EDIT: I'd also like to mention there are variants reported that don't actually appear in the BAM files.
Best regards,
Jonatan
The text was updated successfully, but these errors were encountered: