Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] improve uns DataFrame I/O #2874

Merged
merged 3 commits into from
Aug 14, 2024
Merged

[python] improve uns DataFrame I/O #2874

merged 3 commits into from
Aug 14, 2024

Conversation

ryan-williams
Copy link
Member

Issue and/or context: builds on #2872

Changes:

  • During from_anndata ingest, copy DataFrames supplied by user in uns before passing to _write_dataframe, to avoid mutating user-supplied objects
    • obs and var are already copied before being passed to _write_dataframe, by being passed through conversions.decategoricalize_obs_or_var (example)
  • During to_anndata outgest, use _read_dataframe on uns DataFrames, to better restore what was originally ingested
  • Clean up tests (added in [python] More AnnData equality assertions in tests #2872) which codify previous behavior

Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@ryan-williams
Copy link
Member Author

My idea was to merge #2872 first, then this PR's target would become main, then merge this.

Base automatically changed from rw/eq to main August 13, 2024 13:29
Copy link

codecov bot commented Aug 13, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Project coverage is 90.03%. Comparing base (7da6511) to head (a250847).
Report is 3 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2874       +/-   ##
===========================================
+ Coverage   59.35%   90.03%   +30.68%     
===========================================
  Files          92       37       -55     
  Lines       11373     3925     -7448     
  Branches      730        0      -730     
===========================================
- Hits         6750     3534     -3216     
+ Misses       4451      391     -4060     
+ Partials      172        0      -172     
Flag Coverage Δ
libtiledbsoma ?
python 90.03% <85.71%> (+0.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 90.03% <85.71%> (+0.20%) ⬆️
libtiledbsoma ∅ <ø> (∅)

logging.log_io_same(
f"Warning: uns {collection.uri}[{key!r}] has {_UNS_OUTGEST_HINT_KEY} as unrecognized {hint}: leaving this as Pandas DataFrame"
)
extracted[key] = _read_dataframe(element, fallback_index_name="index")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnkerl just want to confirm: this change means outgested uns DataFrames no longer "gain" a soma_joinid column. The idea is to outgest DataFrames that match what was ingested (using the same function/logic that is applied to obs/var, _read_dataframe).

Does that seem right, or do we want to continue outgesting uns DataFrames containing a soma_joinid column that was not present on ingest?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There have been asks in the past (maybe not well-tracked on my part) for:

  • soma_joinid to not be included on outgest of obs/var -- because:
    • This is a SOMA join ID not an AnnData join ID
    • Gets us closer to round-trip fidelity
  • soma_joinid to be included on outgest of obs/var -- because:
    • Customers wanting to see it, as they now consider it "part of the data" they're interested in

Some discussion on what to do about this -- again, not well-tracked on my part -- some in Slack/Shortcut/GitHub, I'll need to recover:

  • Default exclude soma_joinid on outgest of obs/var; opt-in via some TBD yet-another-kwarg, or, some TBD config parameter
  • Default include soma_joinid on outgest of obs/var; opt-in via some TBD yet-another-kwarg, or, some TBD config parameter
  • Maybe something else

This was never resolved AFAIK. I'll take it as an action item to track this.

Meanwhile let's be sure we're not changing the current behavior for obs/var -- the change we made to go from not outgesting soma_joinid (for non-uns), to outgesting soma_joinid (for non-uns) was per an explicit customer request. So we need to keep the current behavior for outgesting non-uns until we track this through product with @aaronwolen .

For outgesting uns dataframes (not obs or var) -- I believe you have the flexibility to do what you think is best, and to factor the code as you wish to (a) retain the current behavior for obs/var and (b) get what you think best for dataframes within uns.

TL;DR re your "Does that seem right" -- yes, but with the above just for complete clarity for all parties involved (not just you and me but everyone reading this).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change we made to go from not outgesting soma_joinid (for non-uns), to outgesting soma_joinid (for non-uns) was per an explicit customer request

I'm confused by this… non-uns (obs/var) do not currently outgest a soma_joinid.

obs_df.drop([SOMA_JOINID], …) was added in #667 (Jan '23 / 1.0.0), and remains present in 1.13.0.

This PR changes uns to drop soma_joinid on outgest, aligning uns with obs/var, and improving round-trip fidelity for uns. Existing obs/var behavior is preserved (i.e. soma_joinid is removed on outgest).

Copy link
Member

@johnkerl johnkerl Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryan-williams thanks, my understanding was out of date then.

Thanks for the history links!

@@ -1047,24 +1035,6 @@ def test_uns_io(tmp_path, outgest_uns_keys):

expected_adata = deepcopy(adata0)

# When outgesting `uns` DataFrames, `to_anndata` fails to remove the `soma_joinid` column added
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B.: this block corresponds to the _read_dataframe line above. If we outgest the same thing we ingest, we no longer need the test to simulate the soma_joinid column addition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this seems safe -- thank you.

I always get nervous changing existing unit-test cases so thank you again -- very much! -- for sequencing these PRs so it's clear to reviewer(s) including myself, and, anyone reading squash-merged PRs afterward (we have customers that read our GH like a newspaper :) ) , that we've made our lists and checked them twice, and that we are doing the right things regarding (a) important feature-improvement/refactor work, and (b) preserving tests which are our bulwark for maintaining customer-level expectations.

Again, I really appreciate the work you're doing here, and how you're sequencing it.

- `_extract_uns`: use `_read_dataframe` (restore index, rm `soma_joinid` on outgest)
- don't mutate users' `uns` DataFrames in `from_anndata`
@@ -1047,24 +1035,6 @@ def test_uns_io(tmp_path, outgest_uns_keys):

expected_adata = deepcopy(adata0)

# When outgesting `uns` DataFrames, `to_anndata` fails to remove the `soma_joinid` column added
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this seems safe -- thank you.

I always get nervous changing existing unit-test cases so thank you again -- very much! -- for sequencing these PRs so it's clear to reviewer(s) including myself, and, anyone reading squash-merged PRs afterward (we have customers that read our GH like a newspaper :) ) , that we've made our lists and checked them twice, and that we are doing the right things regarding (a) important feature-improvement/refactor work, and (b) preserving tests which are our bulwark for maintaining customer-level expectations.

Again, I really appreciate the work you're doing here, and how you're sequencing it.

@ryan-williams ryan-williams merged commit 1e64fdd into main Aug 14, 2024
11 checks passed
@ryan-williams ryan-williams deleted the rw/uns branch August 14, 2024 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants