Description
I'd like to propose a change to the merge method.
Often, I meet cases where I'd like to merge subsets of the same dataset.
However, this currently requires renaming of all dimensions, changing indices and merging them by hand.
As an example, please consider the following dataset:
Dimensions: (genes: 8787, observations: 8166)
Coordinates:
* observations (observations) object 'GTEX-111CU-1826-SM-5GZYN' ... 'GTEX-ZXG5-0005-SM-57WCN'
* genes (genes) object 'ENSG00000227232' ... 'ENSG00000198727'
individual (observations) object 'GTEX-111CU' ... 'GTEX-ZXG5'
subtissue (observations) object 'Adipose_Subcutaneous' ... 'Whole_Blood'
Data variables:
cdf (observations, genes) float32 0.18883839 ... 0.4876754
l2fc (observations, genes) float32 -0.21032093 ... -0.032540113
padj (observations, genes) float32 1.0 1.0 1.0 ... 1.0 1.0 1.0
There is for each subtissue
and individuum
at most one observation.
Now, I'd like to plot all values in subtissue == "Whole_Blood"
against subtissue == "Adipose_Subcutaneous"
. Therefore, I have to join all "Whole_Blood" observations with all "Adipose_Subcutaneous" observations by the "individual" coordinate.
To simplify this task, I'd like to have the following abstraction:
# select tissues
tissue_1 = ds.sel(observations = (ds.subtissue == "Whole_Blood"))
tissue_2 = ds.sel(observations = (ds.subtissue == "Adipose_Subcutaneous"))
# inner join by individual
merged = tissue_1.merge(tissue_2, on="individual", newdim="merge_dim", join="inner")
print(merged)
The result should look like this:
Dimensions: ("genes": 8787, "individual": 286)
Coordinates:
* genes (genes) object 'ENSG00000227232' ... 'ENSG00000198727'
* merge_dim (merge_dim) object 'GTEX-111CU' ... 'GTEX-ZXG5'
observations:1 (merge_dim) object 'GTEX-111CU-1826-SM-5GZYN' ... 'GTEX-ZXG5-1826-SM-5GZYN'
observations:2 (merge_dim) object 'GTEX-111CU-0005-SM-57WCN' ... 'GTEX-ZXG5-0005-SM-57WCN'
subtissue:1 (merge_dim) object 'Whole_Blood' ... 'Whole_Blood'
subtissue:1 (merge_dim) object 'Adipose_Subcutaneous' ... 'Adipose_Subcutaneous'
Data variables:
cdf:1 (merge_dim, genes) float32 0.18883839 ... 0.4876754
cdf:2 (merge_dim, genes) float32 ...
l2fc:1 (merge_dim, genes) float32 -0.21032093 ... -0.032540113
l2fc:2 (merge_dim, genes) float32 ...
padj:1 (merge_dim, genes) float32 1.0 1.0 1.0 ... 1.0 1.0 1.0
padj:2 (merge_dim, genes) float32 ...
To summarize, I'd propose the following changes:
- Add parameter
on: Union[str, List[str], Tuple[str], Dict[str, str]]
This should specify one or multiple coordinates which should be merged.- Simple merge: string
=> merge byleft[str]
andright[str]
- Merge of multiple coords: list or tuple of strings
=> merge by left[str1, str2, ...] and right[str1, str2, ...] - To merge differently named coords: dict, e.g.
{"str_left": "str_right}
)
=> merge byleft[str_left]
andright[str_right]
- Simple merge: string
- Add some parameter like
newdim
to specify the newly created index dimension.
Ifon
specifies multiple coords, this new index dimension should be a multi-index of these coords. - Rename all duplicate coordinates not specified in
on
to some unique name
e.g.left["cdf"] => merged["cdf:1"]
andright["cdf"] => merged["cdf:2"]
In case if the on
parameter's coordinates do not unambiguously describe each data point, they should be combined in a cross-product manner. However, since this could cause a quadratic runtime and memory requirement, I am not sure how this can be handled in a safe manner.
What do you think about this addition?