Skip to content

Add "on"-parameter to "merge" method #3224

Closed
@Hoeze

Description

@Hoeze

I'd like to propose a change to the merge method.

Often, I meet cases where I'd like to merge subsets of the same dataset.
However, this currently requires renaming of all dimensions, changing indices and merging them by hand.

As an example, please consider the following dataset:

Dimensions:          (genes: 8787, observations: 8166)
Coordinates:
  * observations     (observations) object 'GTEX-111CU-1826-SM-5GZYN' ... 'GTEX-ZXG5-0005-SM-57WCN'
  * genes            (genes) object 'ENSG00000227232' ... 'ENSG00000198727'
    individual       (observations) object 'GTEX-111CU' ... 'GTEX-ZXG5'
    subtissue        (observations) object 'Adipose_Subcutaneous' ... 'Whole_Blood'
Data variables:
    cdf              (observations, genes) float32 0.18883839 ... 0.4876754
    l2fc             (observations, genes) float32 -0.21032093 ... -0.032540113
    padj             (observations, genes) float32 1.0 1.0 1.0 ... 1.0 1.0 1.0

There is for each subtissue and individuum at most one observation.

Now, I'd like to plot all values in subtissue == "Whole_Blood" against subtissue == "Adipose_Subcutaneous". Therefore, I have to join all "Whole_Blood" observations with all "Adipose_Subcutaneous" observations by the "individual" coordinate.

To simplify this task, I'd like to have the following abstraction:

# select tissues
tissue_1 = ds.sel(observations = (ds.subtissue == "Whole_Blood"))
tissue_2 = ds.sel(observations = (ds.subtissue == "Adipose_Subcutaneous"))

# inner join by individual
merged = tissue_1.merge(tissue_2, on="individual", newdim="merge_dim", join="inner")

print(merged)

The result should look like this:

Dimensions:          ("genes": 8787, "individual": 286)
Coordinates:
  * genes            (genes) object 'ENSG00000227232' ... 'ENSG00000198727'
  * merge_dim       (merge_dim) object 'GTEX-111CU' ... 'GTEX-ZXG5'
    observations:1   (merge_dim) object 'GTEX-111CU-1826-SM-5GZYN' ... 'GTEX-ZXG5-1826-SM-5GZYN'
    observations:2   (merge_dim) object 'GTEX-111CU-0005-SM-57WCN' ... 'GTEX-ZXG5-0005-SM-57WCN'
    subtissue:1      (merge_dim) object 'Whole_Blood' ... 'Whole_Blood'
    subtissue:1      (merge_dim) object 'Adipose_Subcutaneous' ... 'Adipose_Subcutaneous'
Data variables:
    cdf:1            (merge_dim, genes) float32 0.18883839 ... 0.4876754
    cdf:2            (merge_dim, genes) float32 ...
    l2fc:1           (merge_dim, genes) float32 -0.21032093 ... -0.032540113
    l2fc:2           (merge_dim, genes) float32 ...
    padj:1           (merge_dim, genes) float32 1.0 1.0 1.0 ... 1.0 1.0 1.0
    padj:2           (merge_dim, genes) float32 ...

To summarize, I'd propose the following changes:

  • Add parameter on: Union[str, List[str], Tuple[str], Dict[str, str]]
    This should specify one or multiple coordinates which should be merged.
    • Simple merge: string
      => merge by left[str] and right[str]
    • Merge of multiple coords: list or tuple of strings
      => merge by left[str1, str2, ...] and right[str1, str2, ...]
    • To merge differently named coords: dict, e.g. {"str_left": "str_right})
      => merge by left[str_left] and right[str_right]
  • Add some parameter like newdim to specify the newly created index dimension.
    If on specifies multiple coords, this new index dimension should be a multi-index of these coords.
  • Rename all duplicate coordinates not specified in on to some unique name
    e.g. left["cdf"] => merged["cdf:1"] and right["cdf"] => merged["cdf:2"]

In case if the on parameter's coordinates do not unambiguously describe each data point, they should be combined in a cross-product manner. However, since this could cause a quadratic runtime and memory requirement, I am not sure how this can be handled in a safe manner.

What do you think about this addition?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions