Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genomic coordinate systems #434

Open
tomwhite opened this issue Jan 5, 2021 · 5 comments
Open

Genomic coordinate systems #434

tomwhite opened this issue Jan 5, 2021 · 5 comments
Labels
data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc

Comments

@tomwhite
Copy link
Collaborator

tomwhite commented Jan 5, 2021

Are genomic positions in sgkit 0-based or 1-based? We should decide and document the coordinate system conventions that sgkit uses.

[Please correct any mistakes I've made below!]

When reading VCF, PLINK, and BGEN files we leave the position variables unchanged, so they use the convention of the underlying file format. These are as follows:

In simulate_genotype_call_dataset we create positions starting at 0, which implies they are 0-based. So this is inconsistent with reading the file formats above.

For comparison here are the conventions that other libraries and tools use.

Whichever convention we adopt, we will likely need to support multiple ways of specifying intervals for selection, or windowing, which may use either convention. (This article has a good summary of options.)

For example, chr20:1-100 (for selecting, see #25) is 1-based, fully closed (i.e. the end position is inclusive). On the other hand BED (which would be a useful option for specifying windows) uses 0-based, half-open intervals (i.e. the end position is exclusive).

@tomwhite tomwhite added the data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc label Jan 5, 2021
@jeromekelleher
Copy link
Collaborator

I had hoped that we could be agnostic and just reflect what's in the underlying data files - it's a horrible can of worms. There will always be some case where we do the wrong thing if we try to "correct" the input data, as it's a complete mess out there.

Can we get away with this, just saying "it's up to you to understand what coordinates your data uses"?

The simulation data we produce doesn't matter as it's only toy data for testing anyway.

@jeromekelleher
Copy link
Collaborator

Intervals are a different issue I think - we should be opinionated in this case and do half-open like BED.

@jeromekelleher
Copy link
Collaborator

jeromekelleher commented Jan 8, 2021

I've thought a bit more about this, and I think we probably can't avoid taking a position on this. Fundamentally, a user who is using (standards-compliant) VCF and BED files together would expect us to interpret them correctly. So, we probably do need to choose either one-based or zero-based positions, and add some APIs for shifting input files that may differ by one (as @eric-czech suggested).

The question is, which one do we choose? We will be wrong in a large number of people's eyes whichever we choose (unless we split the difference and add/substract 0.5 from all coordinates 😉 ).

@tomwhite
Copy link
Collaborator Author

Just wanted to note here that the GA4GH Variation Representation Specification uses inter-residue coordinates.

@tomwhite
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc
Projects
None yet
Development

No branches or pull requests

2 participants