Add "on"-parameter to "merge" method

I'd like to propose a change to the merge method.

Often, I meet cases where I'd like to merge subsets of the same dataset.
However, this currently requires renaming of all dimensions, changing indices and merging them by hand.

As an example, please consider the following dataset:
```
Dimensions:          (genes: 8787, observations: 8166)
Coordinates:
  * observations     (observations) object 'GTEX-111CU-1826-SM-5GZYN' ... 'GTEX-ZXG5-0005-SM-57WCN'
  * genes            (genes) object 'ENSG00000227232' ... 'ENSG00000198727'
    individual       (observations) object 'GTEX-111CU' ... 'GTEX-ZXG5'
    subtissue        (observations) object 'Adipose_Subcutaneous' ... 'Whole_Blood'
Data variables:
    cdf              (observations, genes) float32 0.18883839 ... 0.4876754
    l2fc             (observations, genes) float32 -0.21032093 ... -0.032540113
    padj             (observations, genes) float32 1.0 1.0 1.0 ... 1.0 1.0 1.0
```
There is for each `subtissue` and `individuum` at most one observation.

Now, I'd like to plot all values in `subtissue == "Whole_Blood"` against `subtissue == "Adipose_Subcutaneous"`. Therefore, I have to join all "Whole_Blood" observations with all "Adipose_Subcutaneous" observations by the "individual" coordinate.

To simplify this task, I'd like to have the following abstraction:
```python3
# select tissues
tissue_1 = ds.sel(observations = (ds.subtissue == "Whole_Blood"))
tissue_2 = ds.sel(observations = (ds.subtissue == "Adipose_Subcutaneous"))

# inner join by individual
merged = tissue_1.merge(tissue_2, on="individual", newdim="merge_dim", join="inner")

print(merged)
```
The result should look like this:
```
Dimensions:          ("genes": 8787, "individual": 286)
Coordinates:
  * genes            (genes) object 'ENSG00000227232' ... 'ENSG00000198727'
  * merge_dim       (merge_dim) object 'GTEX-111CU' ... 'GTEX-ZXG5'
    observations:1   (merge_dim) object 'GTEX-111CU-1826-SM-5GZYN' ... 'GTEX-ZXG5-1826-SM-5GZYN'
    observations:2   (merge_dim) object 'GTEX-111CU-0005-SM-57WCN' ... 'GTEX-ZXG5-0005-SM-57WCN'
    subtissue:1      (merge_dim) object 'Whole_Blood' ... 'Whole_Blood'
    subtissue:1      (merge_dim) object 'Adipose_Subcutaneous' ... 'Adipose_Subcutaneous'
Data variables:
    cdf:1            (merge_dim, genes) float32 0.18883839 ... 0.4876754
    cdf:2            (merge_dim, genes) float32 ...
    l2fc:1           (merge_dim, genes) float32 -0.21032093 ... -0.032540113
    l2fc:2           (merge_dim, genes) float32 ...
    padj:1           (merge_dim, genes) float32 1.0 1.0 1.0 ... 1.0 1.0 1.0
    padj:2           (merge_dim, genes) float32 ...
```

---------------------------------------


To summarize, I'd propose the following changes:
- Add parameter `on: Union[str, List[str], Tuple[str], Dict[str, str]]`
  This should specify one or multiple coordinates which should be merged.
  - Simple merge: string
    => merge by `left[str]` and `right[str]`
  - Merge of multiple coords: list or tuple of strings
    => merge by left[str1, str2, ...] and right[str1, str2, ...]
  - To merge differently named coords: dict, e.g. `{"str_left": "str_right}`)
    => merge by `left[str_left]` and `right[str_right]`
- Add some parameter like `newdim` to specify the newly created index dimension.
  If `on` specifies multiple coords, this new index dimension should be a multi-index of these coords.
- Rename all duplicate coordinates not specified in `on` to some unique name
  e.g. `left["cdf"] => merged["cdf:1"]` and `right["cdf"] => merged["cdf:2"]`

In case if the `on` parameter's coordinates do not unambiguously describe each data point, they should be combined in a cross-product manner. However, since this could cause a quadratic runtime and memory requirement, I am not sure how this can be handled in a safe manner.

What do you think about this addition?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add "on"-parameter to "merge" method #3224

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add "on"-parameter to "merge" method #3224

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions