-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What metadata should be #2276
Comments
Thank you for the comment. So I understand you essentially want metadata to be a
|
Thanks for detailing this. I'm curious why you explicitly don't want vectors to carry metadata. That wasn't my priority either, but if we can have it for free thanks to how it's implemented, why reject it? Since DataFrames own their columns that doesn't prevent us from having Likewise, why wouldn't plotting methods use the Another point that I wonder about is what should happen when renaming columns. Wouldn't it make sense to preserve the metadata in that case, since the columns contents haven't changed so presumably the description, unit, etc. still apply? Isn't Stata's behavior just due to an incomplete support of labels ? BTW, if you have a reference describing how Stata progates labels across operations, that would be interesting. Probably in most tricky cases @bkamins listed (4 and 5) we should just drop the metadata unless it's the same in the input columns being combined? When only one column has metadata, we could use that, but that could be a bit risky (e.g. you concatenate data frames with an |
@bkamins yes that is how I envision metadata. Functionally, it's just a
I think we should have a rule where the left data frame dominates. But I think that we can maintain the gist of my goal while dropping metadata in some small edge cases. @nalimilan My aversion to column specific metadata is that I feel like metadata only really makes sense at the dataframe level. Labels, for example, need to be interpreted in the context of a data frame. Take, for example, my technique in Stata of writing a "history" of operations in notes. If you see a vector with the
How should one interpret that with a vector on its own? Perhaps there are use cases for metadata that exists without a table, but that seems complicated and would result in a lot of extraneous, not-useful, information floating around.
It would be very nice, but then we would have to agree on a standard and there is a lot of hidden behavior. For instance, what if the user wants the Unfortunately I can't find a full spec for Sata's behavior. However I just played around with it and confirmed that
|
Given the way you describe this PR, it seems like
Strongly agree here, at least in the case of calling
This is complicated, but I agree that left should dominate. With things like |
Given https://arrow.juliadata.org/dev/manual.html#Table-and-column-metadata the question is if we should not go the "easy" way and just support The only thing that would need to be added is extension of Tables.jl API to allow metadata passing. @pdeffebach @nalimilan @quinnj : what do you think? |
I haven't read all the previous discussions again, but requiring custom array types to add column metadata sounds problematic. I think we need a way to attach at least a long name or description to columns without changing their types. As we discussed at JuliaData/DataAPI.jl#22, this can be achieved either by storing the column metadata in the data frame itself, or by keeping a global table associating the object ID with their metadata (which could then be accessed separately from the data frame if needed). Anyway both scenarios can be supported for Arrow serialization and deserialization: we just need a protocol in DataAPI.jl or Tables.jl which allows extracting the metadata and passing it to Arrow. I also think we should anticipate allowing meta-data other than string in the future, even if we don't allow that immediately. |
OK. So I am moving the discussion to JuliaData/DataAPI.jl#22 as it is more general. |
Closed with #3055 |
This post outlines, briefly, how I would like metadata to work in DataFrames.
df.income
is simply aVector
and there is not metadata attached to it in general. Conceptually, think of metadata as an extension of column names. If I passdf.income
to a function, that function only knows it recieves aVector
and does not know it has a name:income
.Metadata should work the same way.
copy(df)
preserves metadata, as doesfilter
etc. It is also persistent acrossjoin
s. For instance, ifdf1
anddf2
both have the column:id
, thenwill preserve metata for all columns. The entry for
metadata(df, :id)
will be the same asmetadata(df1, :id)
because in aleftjoin
the left data frame is thought of as themaster
data frame and the right one is theusing
data frame, in Stata-speak.label
. Rather, if someone wants to graphdf.income
, they should door similar.
My ideal API for this is implemented, at least partly, in #1458. It includes the functions
metadata!
for setting metadata viametadata!(df, :income, :label, "Personal Income")
metadata
for getting metadata for an object, viametadata(df, :income, :label) == "Personal Income"
.Notice, again, that these are handled at the level of the data frame and agnostic about what the columns are, you could change the vector corresponding to
df.income
and the metadata would be the same.Here are @bkamins
True. Just like
df.col
has no name attached to it.Yes. Metadata is attached to a
name
in a data frame.Yes,
df.col2
has no metadata.Exactly,
:col
is a name in the data frame. If the user wants to transfer the metadata from one column to another they can doOr something along those lines. Presumably we can overload
getindex
for cleaner syntax. In stata this would beI used that kind of workflow a lot working with survey data.
The text was updated successfully, but these errors were encountered: