-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retain all source IDs on VariantContext merge + test #8752
base: master
Are you sure you want to change the base?
Conversation
In GATK3, when merging variants, the IDs of all the source VCFs were retained. This code path seems like it intended that, since the variantSources set is generated, but it doesnt get used for anything. This PR will use that set to set the source of the resulting merged VC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbergelson great, thanks for adding the test
@lbergelson: this might be permission-related, but despite entering an approving review github is asking for another. I assume that's b/c I dont have write permissions in this project? |
@bbimber Hmn, yeah, think it needs someone who has direct write access. I'll get a thumb from a teammate. Thanks for looking at it and for the pr! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbergelson A question for you
@@ -1217,7 +1217,10 @@ public static VariantContext simpleMerge(final Collection<VariantContext> unsort | |||
|
|||
final String ID = rsIDs.isEmpty() ? VCFConstants.EMPTY_ID_FIELD : Utils.join(",", rsIDs); | |||
|
|||
final VariantContextBuilder builder = new VariantContextBuilder().source(name).id(ID); | |||
// This preserves the GATK3-like behavior of reporting all sources, delimited with hyphen: | |||
final String allSources = variantSources.isEmpty() ? name : variantSources.stream().sorted().collect(Collectors.joining("-")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we are merging a large number (say, thousands) of VCF records? Do we still want a concatenated list of the source names in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question. Apparently this is what GATK3 did. @bbimber Any objection to capping it at some reasonable length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I dont have personal use-cases around this, but on DISCVR-seq (which ported GATK3's CombineVariants) I get a surprising amount of requests from people around the lack-of or changes to this feature.
I have no real objection to some kind of cap on length. What length would you find reasonable? I suppose that value could be parameterized, but this would require exposing an argument to simpleMerge() and maybe overloading that method.
Anyway, this feels like a valid but fringe case to me. I dont have strong opinions one way or another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combining hundreds or thousands of VCF records is something we actually do frequently in practice. I think a parameterized option would be overkill, but some kind of sensible hardcoded max length for the source field seems appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. In practice, our source names are small, but if you're talking about 1000+ VCFs then should this value be on the order of 500 characters? 1000? 5k?
In your use case, it actually might be more valuable to have a boolean that prevents storing this value (whatever the length), since it might blow up the size of the output VCF. this might argue for overloading the method with this behavior being opt-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbergelson I defer to you to do something reasonable in this PR to address the effects of this change on VCF file size for large cohorts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, after thinking O/N on this I would probably argue for overloading the method with one that adds arguments for "storeAllVcfSources", and "maxSourceLength", but keep the existing signature where storeAllVcfSources=False, and maxSourceLength=-1 (which is moot since the existing behavior of taking the first source would remian). this preserves the status quo, but callers could opt-in to GATK3-like behavior. I am happy to make those changes.
Hi @lbergelson and @droazen: it looks like @lbergelson added a limit to avoid excessive sources. Does that satisfy issues remaining on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbergelson Back to you with another concern
final VariantContextBuilder builder = new VariantContextBuilder().source(name).id(ID); | ||
// This preserves the GATK3-like behavior of reporting multiple sources, delimited with hyphen: | ||
final String allSources = variantSources.isEmpty() ? name : variantSources.stream() | ||
.sorted() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned about doing a sort on the source names for every single merged VCF record we're writing -- with thousands or tens of thousands of samples, that would get expensive!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's reasonable - what about what I proposed a little earlier in this thread and overloading the method to make the existing default case unchanged (list the first source), but have the ability for code to opt-in to the behavior of tracking all sources? I am happy to write that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bbimber That option is looking more attractive -- if you'd like to implement it, please go ahead and tag us for review. If we don't touch the default behavior this will be much easier to get in!
Hey @lbergelson, @droazen or @jamesemery: I'm hoping we could finalize this issue. I am unable to commit directly to this branch. To respond to the suggestions from @droazen I made this PR which i am hoping someone can merge into this feature branch (#8871). To recap:
|
Hey @lbergelson, @droazen or @jamesemery: sorry to keep bugging you all here, but I'm not sure what else to do here. I'm hoping we can finalize and close out this feature, or identify what is needed to do so. I recapped the state of this PR in my post above. I could make a clean feature branch with everything in one place if you want; however, if someone with write permissions could simply approve (#8871) , merge into this branch, and kick off tests that is all I need right now. |
@bbimber I've added a test to your pr #8750. Should be good to merge if tests pass.