-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue a warning if the number of alt alleles exceeds the maximum specified #620
Conversation
7cf8b97
to
439ed89
Compare
Codecov Report
@@ Coverage Diff @@
## main #620 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 35 35
Lines 2815 2829 +14
=========================================
+ Hits 2815 2829 +14
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment about marking subsequent alleles as missing data - I think this would be easier to document and lead to fewer bugs in the long run.
sgkit/io/vcf/vcf_reader.py
Outdated
@@ -640,7 +652,8 @@ def vcf_to_zarr( | |||
specified ploidy will raise an exception. | |||
max_alt_alleles | |||
The (maximum) number of alternate alleles in the VCF file. Any records with more than | |||
this number of alternate alleles will have the extra alleles dropped. | |||
this number of alternate alleles will have the extra alleles dropped (the `variant_allele` | |||
variable will be truncated). Call genotype fields will however be unaffected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in this case the user may get an index error if they use alleles[genotypes[..]]
, right? It might not be obvious to people that this just means they won't be able to find out the allelic value that corresponds to the genotype value.
Code may break downstream in unexpected ways because of this as we violating a basic invariant of the data model. Maybe it would be better to mark subsequent alleles as missing data (-1)? At least well-written code would deal with this in a predictable way, whereas having to deal with arbitrary missing alleles at every step could be quite tedious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on this @eric-czech?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 handling call data. I wonder if there should be a flag to handle the alt allele overflow - with options: error
, turncate
, freq_truncate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought I had was that we could store the entire ALT field as another variable (of type string), which could be used for reconstructing more alleles without having to regenerate the whole dataset. This might be orthogonal to what we are discussing here though.
I also wondered if we could use Zarr ragged arrays, but I don't think you can combine them with variable length strings - and even if you could they don't play well with the general array data model, which assumes fixed array dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ragged arrays would probably be the best solution in the long run, but that's a big change, as you say. Maybe we should break that discussion off into an issue, so we can explore the pros and cons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, if we stored the ALT field as another variable (just a single comma-delimited string, not a ragged array), then I think there is more of a case for keeping all the allele values in the genotype field, since it would be possible to find the allele string if needed. (This would also allow the fixed length alleles
field to be regenerated if needed, without having to read all the data in from scratch.)
We could document the contract here (i.e. the possibility of indexing out of the alleles
array), and also a pattern of how to use xr.where
to ensure the genotypes don't index out of the alleles
array. This would effectively do the truncation when accessing the data.
I'm also a bit wary of marking some genotypes as missing with value -1, since then it's not possible to distinguish if they were missing in the original file, or because they were truncated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! That's a strong argument for truncating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed - clinches it in my view. We don't want two different behaviours when having different flavours of "allele out of bounds", do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now fixed in the latest commit.
This PR has conflicts, @tomwhite please rebase and push updated version 🙏 |
5f79076
to
7c17bac
Compare
Rebased, with no other changes for the time being. |
45b9f21
to
8b3caf7
Compare
This PR has conflicts, @tomwhite please rebase and push updated version 🙏 |
8b3caf7
to
45aafb4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Fixes #487. The maximum number of alleles found while parsing is stored in
ds.attrs["max_alt_alleles_seen"]
(in all cases).