-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermediate VCF, decision sub-types (BK) #4
Comments
Ultimately, I think the usefulness of this field would depend on the application. The reason this field is in here is to support quantifying cases where we get the alleles right but the genotypes wrong. I think this is probably more common than actually having different representations and has been useful for us when developing models for genotyping. The decision of which BK values to use/support would be up to the comparison engine -- using only cm in vcfeval would probably be ok, too. |
BK=miss would indicate that the call is present in only one of the datasets. I think in terms of vcfeval, this can be directly extracted from the truth / query decision types: Anything that is not TP in truth or in query could have BK=miss. |
Can we include a more precise definition of what each of the possible values means? Or perhaps it would make sense to say that it is up to the comparison engine to define/use its own set of distinct sub-types, and that the downstream quantification stage will perform grouping by whatever subtypes are actually used? The latter approach would also allow optional sub-typing of the 'N' values (why did the engine decide not to assign a truth status). |
I think optional sub-typing of the N values is a good idea, different comparison engines will have different reason to not compare certain variants. I guess for the current types, we could supply examples: Direct match (m):
Complex match (cm): Simple left-shifting case (more complicated things are possible):
GT mismatch (gtmm): Direct genotype mismatch, no other variants interfering in superlocus.
Allele mismatch (almm): Direct allele mismatch, no other variants interfering in superlocus.
Missing variants (miss): Variant present in only one input dataset.
Complex mismatch (pm): One goal here would be to highlight / stratify out difficult regions.
Do these make sense? |
So, the important thing is what is the quantification stage going to do with the sub-types (if anything), and are the sub-types expected to be comparable across the comparison engines. My gut feel is that it would be hard to have the sub-types be exactly comparable across engines, as each engine has different notions of which distinctions are natural to calculate. (@bioinformed It would be good to hear what categorizations are natural to vgraph) For example, vcfeval doesn't distinguish between direct and complex matches, so flagging all its matches as one or other wouldn't be strictly correct according to the current m/cm sub-type definitions. It would be easy to identify "fairly-non-complex" matches based on the number of variants within the vcfeval sync region, yet not natural for xcmp to compute. Similarly, the vcfeval notion of common-allele mismatches is similar to, but not quite the same as gtmm (any gtmm would fall into this category, plus non-direct versions) Some examples of useful vcfeval sub-types for matches and non-matches: Match: This example is the same as your complex mismatch example (I am not sure why you labelled those TP as
Simple match: No other variants interfering in superlocus, but there may be representation differences (not implemented, so these would currently just appear as the default match sub-type).
The utility of this sub-type would be to allow easy separation of the quite complicated matches for investigation (alternatively this could be just done by superlocus postprocessing). Mismatch: The default FN/FP category.
Common-allele mismatch: There is a path which has a common allele. Includes zygosity errors (independent of representation).
Highlighting / stratifying out difficult regions would be done by looking the the number of variants sharing the same superlocus id. The point is that there isn't really a 1:1 correspondence between the xcmp and vcfeval sub-types, so I think that allowing the BK sub-types to be engine-defined seems a better way to go. |
I think it is probably important to standardize some of these BK sub-types as much as possible in order to allow the counting step to output our standardized definitions of performance metrics (e.g., TP, FP, FN, etc) at different stringencies (see the different comparison methods at https://github.com/ga4gh/benchmarking-tools/blob/master/doc/standards/GA4GHBenchmarkingPerformanceMetricsDefinitions.md). While standardizing complex match and complex mismatch probably aren't needed for this, standard definitions for GT mismatch (gtmm) and Allele mismatch (almm) are likely needed for some of the comparison methods (in particular, #1 and #2). I think Peter's definitions fit these comparison methods pretty well (if for #1 the "region" is +-0bp). If a tool isn't able to fit these definitions, then perhaps it shouldn't output gtmm or almm? Alternatively, we could modify the comparison method definitions if we think they aren't useful. |
My interpretation of the comparison methods in the document (and indeed the goals of the project) was that they should be independent of the variant representation. It should therefore not matter whether a match is direct vs non-direct (whether that be for a allele-only match or a genotype match). Distinguishing on the basis of direct vs non-direct only serves to provide a measure of the degree to which the same representational choices were made. In fact, as long as the engine outputs an overall status as genotype-match / Allele-match (ideally indicating which allele was the "common" allele, to allow disambiguating pure zygosity errors vs multiallelic cases) / Mismatch, then the determination of whether there is a "direct" match seems to be a syntactic notion that could be determined during quantification (should you want to). (This would avoid needing multiple, possibly conflicting, implementations). Perhaps the engine should output sub-types (corresponding to those in the comparison document) such as: gtmatch / almatch / miss. |
I agree that the direct/indirect distinction is not really important, and In terms of subtypes, in addition to gtmatch, almatch, and miss, I think we On Thu, Nov 19, 2015 at 6:24 PM, Len Trigg notifications@github.com wrote:
|
I agree that the direct / indirect match category is something that could also live in a separate optional field -- BT comes to mind? This would also clarify #3 . I do think it is useful to keep track of the cases which have a different representation, mainly because convincing people that different representations are an important thing to handle is also one of our goals. |
Justin said in #8:
I quite like the positive category labels, as then the BK seems more consistent across the different comparison types (in fact the quantification breakdowns can be probably even be derived from the BK depending on whether you want to do produce statistics according to method 1/2/3). It seems quite natural to assign:
I have a test branch which implements most of this if you want to try it (doesn't support loosematch, or splitting records to handle the multi-BS edge case mentioned in #5). Use --output-mode ga4gh to produce the new format output file. |
I like positive categories and the fact that they are easy to explain for TPs. I think the main application of negative classification would be stratifying FPs. E.g. looking at het/hom mismatches or allele mismatches (e.g. deletions/insertions of slightly different length) is useful. |
Maybe to make some progress on this, we could split this into two fields: A BK field which says either
We can introduce a separate field BX (benchmarking extended match information?) which can then (optionally, depending on the benchmarking type / engine) be used to encode things like
|
Thanks for keeping this moving. Does having 2 fields make it easier to do counting? As far as I understand, BX is just more specific than BK, and the BK status could be inferred from BX. Is that right? Also, wouldn't TP's always be gtmatch? |
I guess we could construct more complicated het / hetalt / phasing examples where GTs for TPs can be slightly different. The advantage I can think of for doing two fields rather than one is that it would be easier to transparently describe how they affect different parts of the counting procedure:
|
Regarding the BK field, I had assumed that the surrounding framework would take care of that (e.g. by pre-filtering any calls outside the truth confident regions before passing to the engine), and I think it's good to keep such filtering separate from the matching semantics as much as possible. The BX field sounds kind of unwieldy, as it seems like it would become a multi-valued field (e.g. a call could be both a gtmismatch but an almatch, and possibly other values too). With the superset approach of loose > almatch > gtmatch only one value is needed, and this seems to be the primary categorisation in the comparison document. I agree that additional annotations for subcategories like STR calls could be useful though. |
I think the superset approach of loose > almatch > gtmatch would simplify In terms of prefiltering, my one concern with this is where part of a On Wed, Dec 16, 2015 at 2:07 PM Len Trigg notifications@github.com wrote:
|
Using this approach, maybe we could define BK to have the following values and do without BX: missing; lose match [there is a variant nearby] > approx. almatch > almatch > gtmatch > hapmatch [haplotypes can be resolved as the same] We could then define TP/FP/FN on top of this for different comparison methods as follows: Lose comparison: TP: BK in {lose match / approx / almatch / gtmatch / hapmatch} Exact allele comparison: TP: BK in {almatch / gtmatch / hapmatch} Exact GT comparison: TP: BK in {gtmatch / hapmatch} W.r.t. prefiltering I also agree with @jzook -- the main reasoning for doing the UNK regions after comparison in hap.py is that this allows us to be more accurate at the boundaries of confident regions -- when variants have different representations in the query, they might not necessary fall into the confident call regions, so filtering before comparison would wrongly give FNs. |
@pkrusche I am unsure of the semantics of the additional BK values you've added (particularly loose vs approx, and gtmatch vs hapmatch), can we tighten/define them? I think this is what you mean, please correct if I'm wrong:
(Previously I was using "gtmatch" to correspond to a representation independent diploid haplotype match). So gtmatch seems to be a more specific match than hapmatch in that the haplotypes need to match and the same representation be used? Or are you thinking they would be done independently according to different matching techniques? (and since vcfeval only matches haplotypes, it would output almatch/hapmatch, never gtmatch) I'll open a separate issue for discussion around confident regions. |
I agree with @Lenbok that the way he's defined gtmatch doesn't make a lot of sense. I feel like that's getting back into the direct/indirect comparison which we've agreed isn't very useful. Also, I'm guessing that @pkrusche intended the approx category to address things like STR's (@pkrusche correct me if I'm wrong). I agree that's useful information, but since it's not included in any of our comparison methods I'm not sure it belongs in BK. I would propose putting that information into a separate optional (since not all engines will compute it) field. Unless we all think we should add an additional comparison method that addresses STRs? Anyway, I would propose the following definitions:
Also, stop me if this is a terrible idea (and it may be since different comparison engines will define superloci differently), but it seems like a in our comparison method #1 we could define a loose match as "there is a variant in both the truth and the query within this superlocus" rather than an arbitrary 50bp (or whatever length) threshold. |
Yes, these definitions look good to me. I also agree that leaving STRs for later and linking loose matches to superloci seem like good ideas. |
I also think the definitions proposed by @RebeccaTruty make sense, so it sounds like everyone may agree on these? I'm not sure I like the idea of linking loose matches to superloci because I suspect that it will be hard for people to interpret what the output means, especially when each tool has a different, complicated, definition for superloci. 50bp is arbitrary though, so it's likely useful for 50bp to be a parameter that could be changed. Right now, the hapdip tool from @lh3 is the only tool I know that does loose matching, and it allows the user to specify the "looseness". |
Perhaps we shouldn't drop the x-bp version, but if it's easy to compute a superloci-match (I suspect it is) then maybe we can add it? But I would consider it low priority. |
I'm fine with keeping the superloci match as well. How about these as definitions:
|
+1 |
Since superloci can vary in size (potentially depending both on the algorithm and the complexity of the region) it's not true that the superlocus match category is either more general or more specific than loose match. Thus you can no longer get away with supplying a single BK value. Rather than making BK be multi-valued, I would opt for simplicity by leaving out the superlocus match unless there you feel there is a clear need for it. |
This is a good point though I was not thinking of making it multivalued. I
|
Good point Len, I hadn't thought of that. In that case I agree with Justin that we can just drop the superlocus match. |
I added some examples in #12 -- please have a look and see if this makes sense -- here is a link to the updated file: https://github.com/ga4gh/benchmarking-tools/blob/issue-4-clarify-bk/doc/ref-impl/intermediate.md |
Looks good to me. Thanks Peter! |
For decision subtype, I am not clear what values to use for vcfeval results, as the terminology seems to be more oriented toward calls that have been through normalization.
I would probably put any TP as complex match since vcfeval doesn't have any notion of a "direct" match. I guess we could try to determine after-the-fact consider direct to be where the GT fields are the same, although this seems a bit pointless (and would have to be extra careful to handle cases where the input VCFs had multiple records at the same position). Slightly more natural (and useful) would be to count the number of TRUTH/QUERY variants in the sync region and take any 1:1 cases as "direct", even though they may actually be at different coordinates.
For non-TP the vcfeval semantics are really "this variant could not be included in a matching haplotype pair", which I would probably denote as complex mismatch. Our latest version can automatically run haploid matching on the FN/FP in order to find calls that have a common allele (e.g. typically matching calls that differ due to zygosity errors or half-calls), the closest semantics for any of these "haploid TP" here may be genotype mismatch?
I'm also not clear what BK=miss means (does it just mean that the call in this cell is '.'?)
The text was updated successfully, but these errors were encountered: