k-mer peak estimation from splitHaplotypes can pick a bad peak #2189

ASLeonard · 2022-12-28T15:07:39Z

Hi,
I was doing some triobinning , and noticed hap1 had 0 reads while hap2 had the vast majority. As it turned out, the estimated informative k-mer peak for hap1 was at 1018, while for hap2 it was 5.

I did some digging and it was due to this line

canu/src/haplotyping/splitHaplotype.C

Line 380 in f29343c

    
           if (2 * minAve * aveSize < thisSum)   //  Over estimates the minimum sum when thisLen < aveSize - i.e., for

where for my k-mer distribution that ratio of thisSum/(minAve+aveSize) peaked at 1.863 at minFreq=5. So it is fairly easy to set this manually for splitHaplotype, but I was wondering if there is a more lenient condition or at least some fallback, because nearly picking minFreq=5 but then actually picking minFreq=1018 is a huge difference and filters out effectively most k-mers. This also finished the for loop so never actually satisfied the break condition.

Not very rigorous, but I could recover minFreq=5 (which isn't necessarily the "right" answer) by doing something like

double  maxRatio=0;
uint32 maxRatioFreq=1;


if (thisSum/(minAve*aveSize)>maxRatio) {
  maxRatio=thisSum/(minAve*aveSize);
  maxRatioFreq=minFreq;
}
...
// if haven't hit the break condition but exited the for loop
if (f==histoLen) minFreq=maxRatioFreq;

The text was updated successfully, but these errors were encountered:

skoren · 2022-12-28T15:43:56Z

This can happen with very low coverage samples w/o a clear peak. What do your histograms look like and how much coverage do you expect of the parents?

ASLeonard · 2022-12-28T15:58:41Z

Expected coverage is around 17x on a ~2.7 Gb genome for hap1 (male), closer to 19x for hap2 (female). Here are the first 30 lines from the histogram for both.

freq       hap1          hap2
1	2752020922	2598297790
2	116826104	88764180
3	52226425	27557240
4	49388283	27605726
5	52549830	36761304
6	58150638	50724134
7	65051710	68644213
8	72499198	89785046
9	80073731	112665134
10	87184644	134932322
11	93499245	153651706
12	98677390	166085744
13	102541267	170537450
14	104704033	166853575
15	105060039	155790544
16	103700640	139116871
17	100616631	119072377
18	95977026	98054646
19	90045410	77907581
20	83094461	59934248
21	75458877	44809682
22	67448850	32719869
23	59377815	23486298
24	51500014	16716901
25	43964249	11944480
26	36987843	8681586
27	30698131	6495435
28	25131305	5045836
29	20335780	4085464
30	16264130	3444988

…pe. Issue #2189.

skoren assigned brianwalenz Sep 1, 2023

brianwalenz added a commit that referenced this issue Jul 21, 2024

Use more reliable slope-based histogram min detection in splitHaploty…

2aa139e

…pe. Issue #2189.

brianwalenz closed this as completed Jul 21, 2024

skoren mentioned this issue Sep 30, 2024

triocanu results: a large number of reads were undistinguishable from haplotypes #2345

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k-mer peak estimation from splitHaplotypes can pick a bad peak #2189

k-mer peak estimation from splitHaplotypes can pick a bad peak #2189

ASLeonard commented Dec 28, 2022

skoren commented Dec 28, 2022

ASLeonard commented Dec 28, 2022

k-mer peak estimation from splitHaplotypes can pick a bad peak #2189

k-mer peak estimation from splitHaplotypes can pick a bad peak #2189

Comments

ASLeonard commented Dec 28, 2022

skoren commented Dec 28, 2022

ASLeonard commented Dec 28, 2022