-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Genomic coordinate systems #434
Comments
I had hoped that we could be agnostic and just reflect what's in the underlying data files - it's a horrible can of worms. There will always be some case where we do the wrong thing if we try to "correct" the input data, as it's a complete mess out there. Can we get away with this, just saying "it's up to you to understand what coordinates your data uses"? The simulation data we produce doesn't matter as it's only toy data for testing anyway. |
Intervals are a different issue I think - we should be opinionated in this case and do half-open like BED. |
I've thought a bit more about this, and I think we probably can't avoid taking a position on this. Fundamentally, a user who is using (standards-compliant) VCF and BED files together would expect us to interpret them correctly. So, we probably do need to choose either one-based or zero-based positions, and add some APIs for shifting input files that may differ by one (as @eric-czech suggested). The question is, which one do we choose? We will be wrong in a large number of people's eyes whichever we choose (unless we split the difference and add/substract 0.5 from all coordinates 😉 ). |
Just wanted to note here that the GA4GH Variation Representation Specification uses inter-residue coordinates. |
Are genomic positions in sgkit 0-based or 1-based? We should decide and document the coordinate system conventions that sgkit uses.
[Please correct any mistakes I've made below!]
When reading VCF, PLINK, and BGEN files we leave the position variables unchanged, so they use the convention of the underlying file format. These are as follows:
In
simulate_genotype_call_dataset
we create positions starting at 0, which implies they are 0-based. So this is inconsistent with reading the file formats above.For comparison here are the conventions that other libraries and tools use.
Whichever convention we adopt, we will likely need to support multiple ways of specifying intervals for selection, or windowing, which may use either convention. (This article has a good summary of options.)
For example,
chr20:1-100
(for selecting, see #25) is 1-based, fully closed (i.e. the end position is inclusive). On the other hand BED (which would be a useful option for specifying windows) uses 0-based, half-open intervals (i.e. the end position is exclusive).The text was updated successfully, but these errors were encountered: