-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] improve uns
DataFrame I/O
#2874
Conversation
010aa63
to
3276107
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
My idea was to merge #2872 first, then this PR's target would become |
c44da38
to
58fd462
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2874 +/- ##
===========================================
+ Coverage 59.35% 90.03% +30.68%
===========================================
Files 92 37 -55
Lines 11373 3925 -7448
Branches 730 0 -730
===========================================
- Hits 6750 3534 -3216
+ Misses 4451 391 -4060
+ Partials 172 0 -172
Flags with carried forward coverage won't be shown. Click here to find out more.
|
logging.log_io_same( | ||
f"Warning: uns {collection.uri}[{key!r}] has {_UNS_OUTGEST_HINT_KEY} as unrecognized {hint}: leaving this as Pandas DataFrame" | ||
) | ||
extracted[key] = _read_dataframe(element, fallback_index_name="index") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnkerl just want to confirm: this change means outgested uns
DataFrames no longer "gain" a soma_joinid
column. The idea is to outgest DataFrames that match what was ingested (using the same function/logic that is applied to obs
/var
, _read_dataframe
).
Does that seem right, or do we want to continue outgesting uns
DataFrames containing a soma_joinid
column that was not present on ingest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There have been asks in the past (maybe not well-tracked on my part) for:
soma_joinid
to not be included on outgest ofobs
/var
-- because:- This is a SOMA join ID not an AnnData join ID
- Gets us closer to round-trip fidelity
soma_joinid
to be included on outgest ofobs
/var
-- because:- Customers wanting to see it, as they now consider it "part of the data" they're interested in
Some discussion on what to do about this -- again, not well-tracked on my part -- some in Slack/Shortcut/GitHub, I'll need to recover:
- Default exclude
soma_joinid
on outgest ofobs
/var
; opt-in via some TBD yet-another-kwarg, or, some TBD config parameter - Default include
soma_joinid
on outgest ofobs
/var
; opt-in via some TBD yet-another-kwarg, or, some TBD config parameter - Maybe something else
This was never resolved AFAIK. I'll take it as an action item to track this.
Meanwhile let's be sure we're not changing the current behavior for obs
/var
-- the change we made to go from not outgesting soma_joinid
(for non-uns), to outgesting soma_joinid
(for non-uns) was per an explicit customer request. So we need to keep the current behavior for outgesting non-uns until we track this through product with @aaronwolen .
For outgesting uns
dataframes (not obs
or var
) -- I believe you have the flexibility to do what you think is best, and to factor the code as you wish to (a) retain the current behavior for obs
/var
and (b) get what you think best for dataframes within uns
.
TL;DR re your "Does that seem right" -- yes, but with the above just for complete clarity for all parties involved (not just you and me but everyone reading this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the change we made to go from not outgesting
soma_joinid
(for non-uns), to outgestingsoma_joinid
(for non-uns) was per an explicit customer request
I'm confused by this… non-uns (obs
/var
) do not currently outgest a soma_joinid
.
obs_df.drop([SOMA_JOINID], …)
was added in #667 (Jan '23 / 1.0.0), and remains present in 1.13.0.
This PR changes uns
to drop soma_joinid
on outgest, aligning uns
with obs
/var
, and improving round-trip fidelity for uns
. Existing obs
/var
behavior is preserved (i.e. soma_joinid
is removed on outgest).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryan-williams thanks, my understanding was out of date then.
Thanks for the history links!
@@ -1047,24 +1035,6 @@ def test_uns_io(tmp_path, outgest_uns_keys): | |||
|
|||
expected_adata = deepcopy(adata0) | |||
|
|||
# When outgesting `uns` DataFrames, `to_anndata` fails to remove the `soma_joinid` column added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N.B.: this block corresponds to the _read_dataframe
line above. If we outgest the same thing we ingest, we no longer need the test to simulate the soma_joinid
column addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this seems safe -- thank you.
I always get nervous changing existing unit-test cases so thank you again -- very much! -- for sequencing these PRs so it's clear to reviewer(s) including myself, and, anyone reading squash-merged PRs afterward (we have customers that read our GH like a newspaper :) ) , that we've made our lists and checked them twice, and that we are doing the right things regarding (a) important feature-improvement/refactor work, and (b) preserving tests which are our bulwark for maintaining customer-level expectations.
Again, I really appreciate the work you're doing here, and how you're sequencing it.
- `_extract_uns`: use `_read_dataframe` (restore index, rm `soma_joinid` on outgest) - don't mutate users' `uns` DataFrames in `from_anndata`
@@ -1047,24 +1035,6 @@ def test_uns_io(tmp_path, outgest_uns_keys): | |||
|
|||
expected_adata = deepcopy(adata0) | |||
|
|||
# When outgesting `uns` DataFrames, `to_anndata` fails to remove the `soma_joinid` column added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this seems safe -- thank you.
I always get nervous changing existing unit-test cases so thank you again -- very much! -- for sequencing these PRs so it's clear to reviewer(s) including myself, and, anyone reading squash-merged PRs afterward (we have customers that read our GH like a newspaper :) ) , that we've made our lists and checked them twice, and that we are doing the right things regarding (a) important feature-improvement/refactor work, and (b) preserving tests which are our bulwark for maintaining customer-level expectations.
Again, I really appreciate the work you're doing here, and how you're sequencing it.
Issue and/or context: builds on #2872
Changes:
from_anndata
ingest,copy
DataFrames supplied by user inuns
before passing to_write_dataframe
, to avoid mutating user-supplied objectsobs
andvar
are already copied before being passed to_write_dataframe
, by being passed throughconversions.decategoricalize_obs_or_var
(example)to_anndata
outgest, use_read_dataframe
onuns
DataFrames, to better restore what was originally ingested