Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k-mer peak estimation from splitHaplotypes can pick a bad peak #2189

Closed
ASLeonard opened this issue Dec 28, 2022 · 2 comments
Closed

k-mer peak estimation from splitHaplotypes can pick a bad peak #2189

ASLeonard opened this issue Dec 28, 2022 · 2 comments
Assignees

Comments

@ASLeonard
Copy link

Hi,
I was doing some triobinning , and noticed hap1 had 0 reads while hap2 had the vast majority. As it turned out, the estimated informative k-mer peak for hap1 was at 1018, while for hap2 it was 5.

I did some digging and it was due to this line

if (2 * minAve * aveSize < thisSum) // Over estimates the minimum sum when thisLen < aveSize - i.e., for

where for my k-mer distribution that ratio of thisSum/(minAve+aveSize) peaked at 1.863 at minFreq=5. So it is fairly easy to set this manually for splitHaplotype, but I was wondering if there is a more lenient condition or at least some fallback, because nearly picking minFreq=5 but then actually picking minFreq=1018 is a huge difference and filters out effectively most k-mers. This also finished the for loop so never actually satisfied the break condition.

Not very rigorous, but I could recover minFreq=5 (which isn't necessarily the "right" answer) by doing something like

double  maxRatio=0;
uint32 maxRatioFreq=1;


if (thisSum/(minAve*aveSize)>maxRatio) {
  maxRatio=thisSum/(minAve*aveSize);
  maxRatioFreq=minFreq;
}
...
// if haven't hit the break condition but exited the for loop
if (f==histoLen) minFreq=maxRatioFreq;
@skoren
Copy link
Member

skoren commented Dec 28, 2022

This can happen with very low coverage samples w/o a clear peak. What do your histograms look like and how much coverage do you expect of the parents?

@ASLeonard
Copy link
Author

Expected coverage is around 17x on a ~2.7 Gb genome for hap1 (male), closer to 19x for hap2 (female). Here are the first 30 lines from the histogram for both.

freq       hap1          hap2
1	2752020922	2598297790
2	116826104	88764180
3	52226425	27557240
4	49388283	27605726
5	52549830	36761304
6	58150638	50724134
7	65051710	68644213
8	72499198	89785046
9	80073731	112665134
10	87184644	134932322
11	93499245	153651706
12	98677390	166085744
13	102541267	170537450
14	104704033	166853575
15	105060039	155790544
16	103700640	139116871
17	100616631	119072377
18	95977026	98054646
19	90045410	77907581
20	83094461	59934248
21	75458877	44809682
22	67448850	32719869
23	59377815	23486298
24	51500014	16716901
25	43964249	11944480
26	36987843	8681586
27	30698131	6495435
28	25131305	5045836
29	20335780	4085464
30	16264130	3444988

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants