-
Notifications
You must be signed in to change notification settings - Fork 35
Description
The feature
It would be really nice to have a way of associating chromosome info with the dataframe containing the ranges. I would propose using pd.DataFrame.attrs for storing metadata like chromosome info, column names.
Why
GRanges objects from bioconductor have a @seqinfo attribute that contains sequence info about the assembly being used. For example:
library(EnsDb.Hsapiens.v86)
ensdb = EnsDb.Hsapiens.v86
g = genes(ensdb)
head(g, 3)
# GRanges object with 3 ranges and 6 metadata columns:
# seqnames ranges strand | gene_id gene_name gene_biotype seq_coord_system symbol entrezid
# <Rle> <IRanges> <Rle> | <character> <character> <character> <character> <character> <list>
# ENSG00000223972 1 11869-14409 + | ENSG00000223972 DDX11L1 transcribed_unproces.. chromosome DDX11L1 100287596,100287102,727856,...
# ENSG00000227232 1 14404-29570 - | ENSG00000227232 WASH7P unprocessed_pseudogene chromosome WASH7P <NA>
# ENSG00000278267 1 17369-17436 - | ENSG00000278267 MIR6859-1 miRNA chromosome MIR6859-1 102466751
# -------
# seqinfo: 357 sequences (1 circular) from GRCh38 genome
g@seqinfo
# Seqinfo object with 357 sequences (1 circular) from GRCh38 genome:
# seqnames seqlengths isCircular genome
# 1 248956422 FALSE GRCh38
# 10 133797422 FALSE GRCh38
# 11 135086622 FALSE GRCh38
# 12 133275309 FALSE GRCh38
# 13 114364328 FALSE GRCh38
# ... ... ... ...
# LRG_741 231167 FALSE GRCh38
# LRG_93 22459 FALSE GRCh38
# MT 16569 TRUE GRCh38
# X 156040895 FALSE GRCh38
# Y 57227415 FALSE GRCh38It would be nice if we could also attach this kind of information to our range dataframe for use with bioframe. This could be done by putting something equivalent to @seqinfo into the pd.DataFrame.attrs attribute. Something similar could also be done for different range column names.
Current use of global configuration
With cols, this library already provides ways of setting different values without needing to pass them all the time (docs). These are using a global config or temporarily modifying that config with a context manager.
I think both of these are less ergonomic
- They require explicit code for something which could be explicit in the data, but implicit in the code.
- They're global, and don't allow working with different configurations at the same time
Downsides
pd.DataFrame.attrs
The main downside is pd.DataFrame.attrs.
- It's still marked as experimental, and can change
- It doesn't show up in the repr, so it's not obvious if anything has been added
I would hope that usage here could influence further development of the features.
May not work with other backends
It's not immediately obvious whether alternative backends would also support this kind of feature
- (I proposed alternative backends in Alternative DataFrame class(es) for OOC + speed #137)
Alternatives
- Do nothing, keep passing this metadata as is.
- Custom class of some sort (like bioconductor)
- Instead of a custom dataframe class, this could be a pandas extension array, which would be a lighter touch.
- But this doesn't fit with the current
bioframedesign