[python] improve `uns` DataFrame I/O #2874

ryan-williams · 2024-08-09T23:13:19Z

Issue and/or context: builds on #2872

Changes:

During from_anndata ingest, copy DataFrames supplied by user in uns before passing to _write_dataframe, to avoid mutating user-supplied objects
- obs and var are already copied before being passed to _write_dataframe, by being passed through conversions.decategoricalize_obs_or_var (example)
During to_anndata outgest, use _read_dataframe on uns DataFrames, to better restore what was originally ingested
Clean up tests (added in [python] More AnnData equality assertions in tests #2872) which codify previous behavior

johnkerl

🚢

ryan-williams · 2024-08-12T19:35:40Z

My idea was to merge #2872 first, then this PR's target would become main, then merge this.

codecov · 2024-08-13T14:30:57Z

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Project coverage is 90.03%. Comparing base (7da6511) to head (a250847).
Report is 3 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2874       +/-   ##
===========================================
+ Coverage   59.35%   90.03%   +30.68%     
===========================================
  Files          92       37       -55     
  Lines       11373     3925     -7448     
  Branches      730        0      -730     
===========================================
- Hits         6750     3534     -3216     
+ Misses       4451      391     -4060     
+ Partials      172        0      -172

Flag	Coverage Δ
libtiledbsoma	`?`
python	`90.03% <85.71%> (+0.20%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
python_api	`90.03% <85.71%> (+0.20%)`	⬆️
libtiledbsoma	`∅ <ø> (∅)`

ryan-williams · 2024-08-14T11:57:12Z

apis/python/src/tiledbsoma/io/outgest.py

+                    logging.log_io_same(
+                        f"Warning: uns {collection.uri}[{key!r}] has {_UNS_OUTGEST_HINT_KEY} as unrecognized {hint}: leaving this as Pandas DataFrame"
+                    )
+                extracted[key] = _read_dataframe(element, fallback_index_name="index")


@johnkerl just want to confirm: this change means outgested uns DataFrames no longer "gain" a soma_joinid column. The idea is to outgest DataFrames that match what was ingested (using the same function/logic that is applied to obs/var, _read_dataframe).

Does that seem right, or do we want to continue outgesting uns DataFrames containing a soma_joinid column that was not present on ingest?

There have been asks in the past (maybe not well-tracked on my part) for:

soma_joinid to not be included on outgest of obs/var -- because:

This is a SOMA join ID not an AnnData join ID

Gets us closer to round-trip fidelity

soma_joinid to be included on outgest of obs/var -- because:

Customers wanting to see it, as they now consider it "part of the data" they're interested in

Some discussion on what to do about this -- again, not well-tracked on my part -- some in Slack/Shortcut/GitHub, I'll need to recover:

Default exclude soma_joinid on outgest of obs/var; opt-in via some TBD yet-another-kwarg, or, some TBD config parameter

Default include soma_joinid on outgest of obs/var; opt-in via some TBD yet-another-kwarg, or, some TBD config parameter

Maybe something else

This was never resolved AFAIK. I'll take it as an action item to track this.

Meanwhile let's be sure we're not changing the current behavior for obs/var -- the change we made to go from not outgesting soma_joinid (for non-uns), to outgesting soma_joinid (for non-uns) was per an explicit customer request. So we need to keep the current behavior for outgesting non-uns until we track this through product with @aaronwolen .

For outgesting uns dataframes (not obs or var) -- I believe you have the flexibility to do what you think is best, and to factor the code as you wish to (a) retain the current behavior for obs/var and (b) get what you think best for dataframes within uns.

TL;DR re your "Does that seem right" -- yes, but with the above just for complete clarity for all parties involved (not just you and me but everyone reading this).

the change we made to go from not outgesting soma_joinid (for non-uns), to outgesting soma_joinid (for non-uns) was per an explicit customer request

I'm confused by this… non-uns (obs/var) do not currently outgest a soma_joinid.

obs_df.drop([SOMA_JOINID], …) was added in #667 (Jan '23 / 1.0.0), and remains present in 1.13.0.

This PR changes uns to drop soma_joinid on outgest, aligning uns with obs/var, and improving round-trip fidelity for uns. Existing obs/var behavior is preserved (i.e. soma_joinid is removed on outgest).

@ryan-williams thanks, my understanding was out of date then.

Thanks for the history links!

ryan-williams · 2024-08-14T11:58:56Z

apis/python/tests/test_basic_anndata_io.py

@@ -1047,24 +1035,6 @@ def test_uns_io(tmp_path, outgest_uns_keys):

    expected_adata = deepcopy(adata0)

-    # When outgesting `uns` DataFrames, `to_anndata` fails to remove the `soma_joinid` column added


N.B.: this block corresponds to the _read_dataframe line above. If we outgest the same thing we ingest, we no longer need the test to simulate the soma_joinid column addition.

Yes this seems safe -- thank you.

I always get nervous changing existing unit-test cases so thank you again -- very much! -- for sequencing these PRs so it's clear to reviewer(s) including myself, and, anyone reading squash-merged PRs afterward (we have customers that read our GH like a newspaper :) ) , that we've made our lists and checked them twice, and that we are doing the right things regarding (a) important feature-improvement/refactor work, and (b) preserving tests which are our bulwark for maintaining customer-level expectations.

Again, I really appreciate the work you're doing here, and how you're sequencing it.

- `_extract_uns`: use `_read_dataframe` (restore index, rm `soma_joinid` on outgest) - don't mutate users' `uns` DataFrames in `from_anndata`

johnkerl · 2024-08-14T14:21:09Z

apis/python/tests/test_basic_anndata_io.py

@@ -1047,24 +1035,6 @@ def test_uns_io(tmp_path, outgest_uns_keys):

    expected_adata = deepcopy(adata0)

-    # When outgesting `uns` DataFrames, `to_anndata` fails to remove the `soma_joinid` column added


Yes this seems safe -- thank you.

I always get nervous changing existing unit-test cases so thank you again -- very much! -- for sequencing these PRs so it's clear to reviewer(s) including myself, and, anyone reading squash-merged PRs afterward (we have customers that read our GH like a newspaper :) ) , that we've made our lists and checked them twice, and that we are doing the right things regarding (a) important feature-improvement/refactor work, and (b) preserving tests which are our bulwark for maintaining customer-level expectations.

Again, I really appreciate the work you're doing here, and how you're sequencing it.

This was referenced Aug 9, 2024

[python] More AnnData equality assertions in tests #2872

Merged

[python] update_uns_by_uri, _update_uns #2876

Merged

ryan-williams force-pushed the rw/eq branch 2 times, most recently from 010aa63 to 3276107 Compare August 12, 2024 14:49

ryan-williams force-pushed the rw/uns branch from b564f73 to 0ee668e Compare August 12, 2024 18:04

ryan-williams marked this pull request as ready for review August 12, 2024 18:23

ryan-williams requested a review from johnkerl August 12, 2024 18:24

johnkerl approved these changes Aug 12, 2024

View reviewed changes

Base automatically changed from rw/eq to main August 13, 2024 13:29

ryan-williams force-pushed the rw/uns branch 2 times, most recently from c44da38 to 58fd462 Compare August 13, 2024 14:18

ryan-williams commented Aug 14, 2024

View reviewed changes

ryan-williams requested a review from johnkerl August 14, 2024 12:02

ryan-williams added 3 commits August 14, 2024 08:14

don't mutate users' uns DataFrames in from_anndata

dd46eb5

_extract_uns: use _read_dataframe

b982d07

uns i/o improvements

a250847

- `_extract_uns`: use `_read_dataframe` (restore index, rm `soma_joinid` on outgest) - don't mutate users' `uns` DataFrames in `from_anndata`

ryan-williams force-pushed the rw/uns branch from 58fd462 to a250847 Compare August 14, 2024 12:15

ryan-williams mentioned this pull request Aug 14, 2024

[python] Docstring typo, test-assertion msgs, maybe_raises improvement, verify_logs helper #2892

Merged

johnkerl approved these changes Aug 14, 2024

View reviewed changes

ryan-williams merged commit 1e64fdd into main Aug 14, 2024
11 checks passed

ryan-williams deleted the rw/uns branch August 14, 2024 21:17

ryan-williams mentioned this pull request Aug 15, 2024

[Bug] pd.DataFrames modified during ingest/outgest round-trips #2829

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] improve `uns` DataFrame I/O #2874

[python] improve `uns` DataFrame I/O #2874

ryan-williams commented Aug 9, 2024

johnkerl left a comment

ryan-williams commented Aug 12, 2024

codecov bot commented Aug 13, 2024 •

edited

Loading

ryan-williams Aug 14, 2024

johnkerl Aug 14, 2024

ryan-williams Aug 14, 2024

johnkerl Aug 14, 2024 •

edited

Loading

ryan-williams Aug 14, 2024

johnkerl Aug 14, 2024

johnkerl Aug 14, 2024

		@@ -1047,24 +1035,6 @@ def test_uns_io(tmp_path, outgest_uns_keys):

		expected_adata = deepcopy(adata0)

		# When outgesting `uns` DataFrames, `to_anndata` fails to remove the `soma_joinid` column added

[python] improve uns DataFrame I/O #2874

[python] improve uns DataFrame I/O #2874

Conversation

ryan-williams commented Aug 9, 2024

johnkerl left a comment

Choose a reason for hiding this comment

ryan-williams commented Aug 12, 2024

codecov bot commented Aug 13, 2024 • edited Loading

Codecov Report

ryan-williams Aug 14, 2024

Choose a reason for hiding this comment

johnkerl Aug 14, 2024

Choose a reason for hiding this comment

ryan-williams Aug 14, 2024

Choose a reason for hiding this comment

johnkerl Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

ryan-williams Aug 14, 2024

Choose a reason for hiding this comment

johnkerl Aug 14, 2024

Choose a reason for hiding this comment

johnkerl Aug 14, 2024

Choose a reason for hiding this comment

[python] improve `uns` DataFrame I/O #2874

[python] improve `uns` DataFrame I/O #2874

codecov bot commented Aug 13, 2024 •

edited

Loading

johnkerl Aug 14, 2024 •

edited

Loading