-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC/API: document how to use metadata #8572
Comments
I think this is a good idea, but we'll need to do it carefully. I'm guessing the When I implemented this in xray, I didn't want to deal with these issues, so I took the more conservative approach of dropping custom metadata (other than name) in all binary arithmetic and aggregations. See http://xray.readthedocs.org/en/stable/faq.html#what-is-your-approach-to-metadata I think this sort of hook system is a good idea to let someone else deal with the complexity. RE: your specific design. I would rather make users define a custom subclass, e.g., |
@shoyer the design for this was for users to subclass / monkey-patch This is just a dispatch to user interaction, like: hey I have these 2 objects which I am adding, how do you want to combine the meta-data. It will drop by default, this just allows a 'plug-in' type of mechanism. |
Hi, Really glad to see the subclassing and finalize behavior come into fruition. I'm not sure I fully understand the scope of this general metadata problem, but in my experience, the relation of metadata to the results of From studying GeoPandas as well as pyuvvis, I get the impression that most pandas-subclassing libraries are going to use a relatively limited set of operations from the API, so maybe it's a good idea to make them work out all of the usecases for metadata as they go, and just provide context and suggestions in the docs? Sorry if this is not germane to the issue at hand ;0 |
@hugadams this modification / doc update is for the USER to really do all of the work. pandas provides a framework if you will. but the USER decides ALL interactions with metadata (otherwise they are NOT propogated). So it is basically wide open for the USER to provide a mechanism to propogate some / raise an error if needed etc. I guess some docs are in order! |
Sounds great |
Resurrecting this with a bit of an alternate / synthesis of previous ideas. The basic idea is to push metadata propagation onto the subclasses, as was previously suggested. The new proposal is for pandas to provide a bit more infrastructure for subclasses, which would remove the need for any global state. The import pandas as pd
class SubclassedDataFrame2(pd.DataFrame):
# normal properties
_metadata = ['color']
def __init__(self, *args, color=None, **kwargs):
self.color = color
super().__init__(*args, **kwargs)
@property
def _constructor(self):
return SubclassedDataFrame2
def __add__(self, other):
if self.color != other.color:
raise ValueError
return super().__add__(self, other)
>>> a = SubclassedDataFrame2({"A": [1, 2], "B": [3, 4]}, color='red') For things like >>> a[['A']].color
red But binary operations don't propagate the metadata >>> (a + a).color # None We could patch A potential solution is for pandas to provide a class Metadata:
def __init__(self, name):
self.name = name
def __repr__(self):
return "Metadata({})".format(self.name)
def __add__(self, left, right):
return None # do not propogate and subclasses would override the methods they want class ColorMetadata(Metadata):
def __add__(self, left, right):
if set(left.color, right.color) == {"blue", "yellow"}:
return 'green'
elif set(left.color, right.color) == {"blue", "red"}:
return "purple"
...
def concat(self, left, right):
return '-'.join([left.color, right.color]) So when defining a subclass, it would be >>> b = SubclassedDataFrame2({"A": [1, 2], "C": [5, 6]}, color='blue')
>>> print((a + b).color)
purple thoughts? |
This is a solid approach. Have you thought about how to handle Series-level metadata when that Series becomes a column in a DataFrame? e.g.
Here |
An issue that comes up with the column-specific metadata is that |
This has moved up a bit on my priority list. I'm hoping to use I'm playing with different APIs right now. The core feature I want to provide is for a given attribute to determine how metadata should be propagated for a given pandas method. I think that necessitates some kind of finalizer like dispatch = {} : Dict[Tuple[method, metadata_name], Callable]
def __finalize__(self, other, method):
for metadata_name in self._metdata:
dispatch[(method, metadata_name)](self, other) There are a few for registering finalizers with the dispatch, but right now I'm favoring something like duplicate_meta = PandasMetadata("disallow_duplicate_labels") # the metadata name
@duplicate_meta.register(pd.concat)
def finalize_concat(new, other):
new.allow_duplicate_labels = all(x.allow_duplicate_labels for x in other) And we would provide a default finalizer that does what we do on master today (copy from The main problems I'm facing now.
I have a work in progress like https://github.com/TomAugspurger/pandas/pull/new/metadata-dispatch. |
AFAICT there isn't a way to comment on that branch until a PR is opened. Can you open it as a "draft" or something? In this discussion it seems like disallow_duplicate_labels is pinned to a Series/DataFrame, but it was originally discussed as an Index attribute. Is that distinction important? |
Not quite ready yet, still figuring out the design. Will have a WIP soonish.
For disallow duplicates, my original idea was on Index, but I think NDFrame is the way to go.
… On Sep 5, 2019, at 18:44, jbrockmendel ***@***.***> wrote:
AFAICT there isn't a way to comment on that branch until a PR is opened. Can you open it as a "draft" or something?
In this discussion it seems like disallow_duplicate_labels is pinned to a Series/DataFrame, but it was originally discussed as an Index attribute. Is that distinction important?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
from SO
xref #2485
xref #7868
This last will require a bit of change in
__finalize__
to handle a metadata finalizer for a specific name (but straightforward)The text was updated successfully, but these errors were encountered: