Genomic coordinate systems #434

tomwhite · 2021-01-05T13:31:40Z

Are genomic positions in sgkit 0-based or 1-based? We should decide and document the coordinate system conventions that sgkit uses.

[Please correct any mistakes I've made below!]

When reading VCF, PLINK, and BGEN files we leave the position variables unchanged, so they use the convention of the underlying file format. These are as follows:

VCF is 1-based, fully closed http://samtools.github.io/hts-specs/VCFv4.3.pdf
PLINK (.bim) is 1-based https://www.cog-genomics.org/plink/2.0/formats#bim
BGEN (I can't find documentation saying which convention this uses)

In simulate_genotype_call_dataset we create positions starting at 0, which implies they are 0-based. So this is inconsistent with reading the file formats above.

For comparison here are the conventions that other libraries and tools use.

scikit-allel is 1-based. VCF POS is not changed, and functions are documented as being 1-based (example).
tskit is 0-based. See discussion here.
Glow is 0-based, at least for VCF. It converts VCF file positions to 0-based internally, (see this section marked 'Important').
Hail is 1-based. See docs, and this warning.

Whichever convention we adopt, we will likely need to support multiple ways of specifying intervals for selection, or windowing, which may use either convention. (This article has a good summary of options.)

For example, chr20:1-100 (for selecting, see #25) is 1-based, fully closed (i.e. the end position is inclusive). On the other hand BED (which would be a useful option for specifying windows) uses 0-based, half-open intervals (i.e. the end position is exclusive).

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2021-01-05T17:33:06Z

I had hoped that we could be agnostic and just reflect what's in the underlying data files - it's a horrible can of worms. There will always be some case where we do the wrong thing if we try to "correct" the input data, as it's a complete mess out there.

Can we get away with this, just saying "it's up to you to understand what coordinates your data uses"?

The simulation data we produce doesn't matter as it's only toy data for testing anyway.

jeromekelleher · 2021-01-05T17:34:09Z

Intervals are a different issue I think - we should be opinionated in this case and do half-open like BED.

jeromekelleher · 2021-01-08T16:46:24Z

I've thought a bit more about this, and I think we probably can't avoid taking a position on this. Fundamentally, a user who is using (standards-compliant) VCF and BED files together would expect us to interpret them correctly. So, we probably do need to choose either one-based or zero-based positions, and add some APIs for shifting input files that may differ by one (as @eric-czech suggested).

The question is, which one do we choose? We will be wrong in a large number of people's eyes whichever we choose (unless we split the difference and add/substract 0.5 from all coordinates 😉 ).

tomwhite · 2021-11-11T12:19:45Z

Just wanted to note here that the GA4GH Variation Representation Specification uses inter-residue coordinates.

tomwhite · 2022-11-30T11:43:19Z

Tutorial:Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems

Python CLI: https://github.com/griffithlab/convert_zero_one_based

tomwhite added the data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc label Jan 5, 2021

tomwhite mentioned this issue Jun 7, 2021

Rename window to window_by_index, and add window_by_position #581

Merged

tomwhite mentioned this issue Sep 16, 2021

Implement region queries #658

Closed

tomwhite mentioned this issue Oct 5, 2021

Mean of windowed popgen stats #662

Open

tomwhite mentioned this issue Dec 5, 2022

window_by_interval #974

Merged

tomwhite mentioned this issue Feb 5, 2024

Variant positions less than one? #1176

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genomic coordinate systems #434

Genomic coordinate systems #434

tomwhite commented Jan 5, 2021

jeromekelleher commented Jan 5, 2021

jeromekelleher commented Jan 5, 2021

jeromekelleher commented Jan 8, 2021 •

edited

Loading

tomwhite commented Nov 11, 2021

tomwhite commented Nov 30, 2022

Genomic coordinate systems #434

Genomic coordinate systems #434

Comments

tomwhite commented Jan 5, 2021

jeromekelleher commented Jan 5, 2021

jeromekelleher commented Jan 5, 2021

jeromekelleher commented Jan 8, 2021 • edited Loading

tomwhite commented Nov 11, 2021

tomwhite commented Nov 30, 2022

jeromekelleher commented Jan 8, 2021 •

edited

Loading