Continue adding Metadata to dataframes #1458

pdeffebach · 2018-07-18T16:03:00Z

This PR picks up where #1429 left off, before I got messed up and deleted the branch.

Here are some outstanding issues with the current code:

Allocation for merge!

If df_left' has a :label field and df_right does not, then merge! creates a metadata for the unique columns on df_right and adds a :label field populated with nothings. This is undesirable because

It allocates a new Dict, and a new metadata in general
It adds a bunch of nothings to an array.

My intuition is that (1) is bad but (2) is consistent with what merge! does for the columns of dataframes. So I should change the behavior to just append nothings to all the fields in the metadata for df_left. This is particularly important when df_right is empty. But the concept still applies for if df_left and df_right have different fields.

Getindex and setindex

I like the idea of having metadata(df, :x1, :field) = "my information assign metadata and just metadata(df, :x1, :field retrieve metadata. This means overriding getindex and setindex but only allowing them to be accessed via metadata()?

Mis-typing

I think that automatically making a new vector if you add metadata to a field that doesn't exist yet is dangerous, because then typos can allocate new fields silently. But this is small and can be addressed later.

Future work

joining dataframes calls the new constructor, so to make joins work we might have to touch some of the constructor code to allow metadata.

We will also have to profile this extensively to make sure we aren't losing performance for people who don't wish to use metadata.

asdf

Describe to dataframe

Rearranged get_stats, fixed docstring, re-did data constructor, and other small fixes.

pdeffebach · 2018-08-16T20:15:31Z

Resurrecting this a little bit.

What is the API of metadata?
I was thinking about this recently and I thought of the an API centered around one function, metadata that functions like getindex and setindex

metadata(df, :x, :label) = "A label here"
metadata(df, :x, :label) 
> "A label here"

This works very nicely with broadcasting.

metadata.(Ref(df), names(df), :label) 
> an array of labels

There are only a few helper functions I can think of:
What if we want to add a new field to our metadata? Rather than silently create a new field with a call to metadata(df, :x, :newfield) we should have helper functions that

Allow the user to allocate a new field to metadata
- This is because the cost of a new field might be somewhat high for large dataframes.
Allow the user to delete a field in metadata
Return the fields currently allowed in the metadata.
- This could probably be done with keys since the metadata is ultimately a Dict. Or we could write a new function for this.

With the third of these bullets, you could use broadcasting again to do

metadata(Ref(df), :x,  getmetadatafields(df))
> Vector{Any} of various fields for variable :x

That is 4 functions added to the namespace of DataFrames.

pdeffebach · 2018-08-18T15:40:01Z

I'm gonna put together a larger document about how metadata is handled in various systems to try and work out an API.

here is an issue on pandas which is eerily similar to discussions on discourse about this.

nalimilan

Sorry for the long delay. Below are a few reactions to your points.

I like the idea of having metadata(df, :x1, :field) = "my information" assign metadata and just metadata(df, :x1, :field) retrieve metadata. This means overriding getindex and setindex but only allowing them to be accessed via metadata()?

No, that's not possible. The closest syntax which could work with setindex! is metadata(df)[:x1, :field] = "my information" or metadata(df, :field)[:x1] = "my information", assuming metadata would return an object with a custom type. Else, we have to do metadata!(df, :x1, :field, "my information").

I guess the main advantage of having metadata(df) return a special object is that we could use keys(metdata(df)) to get the list of keys. Then metadata(df)[:somekey] would return a vector or a named tuple, metadata(df)[:, :somecol] would return a named tuple of keys.

Here's a comparison of the syntaxes for common operations:

	`setindex!`/`getindex`	function
Get key for all columns	`metadata(df)[:somekey]`	`metadata(df, :somekey)`
Get key for one column	`metadata(df)[:somekey, :somecol]`	`metadata(df, :somekey, :somecol)`
Get all keys for one column	`metadata(df)[:, :somecol]`	`metadata(df, :, :somecol)`
Set key for one column	`metadata(df)[:somekey, :somecol] = "x"`	`metadata!(df, :somekey, :somecol, "x")`
Get list of keys	`keys(metadata(df))`	`metadatakeys(df)`

More generally, returning a special object would allow supporting more operations in the future without exporting new functions. For example, rename!(metadata(df), :key1 => :key2) could be used to rename a key.

Mis-typing

I think that automatically making a new vector if you add metadata to a field that doesn't exist yet is dangerous, because then typos can allocate new fields silently. But this is small and can be addressed later.

Well, that's what happens if you mistype df[:mycol] = 1 already. To prevent that, we would have to require people to call addfield! manually, which would be quite strict. Also that would make the API more complex.

joining dataframes calls the new constructor, so to make joins work we might have to touch some of the constructor code to allow metadata.
We will also have to profile this extensively to make sure we aren't losing performance for people who don't wish to use metadata.

Honestly I'm not too concerned about that, as the amount of work to do is really small compared with the join itself. However, in general it would be a good idea to check whether the meta-data is empty, and take a fast path in that case to eliminate the overhead.

nalimilan · 2018-08-31T13:37:34Z

src/other/metadata.jl

+    notonright = setdiff(keys(leftmeta.dict), keys(rightmeta.dict))
+
+    for field in notonleft
+        newfield!(leftmeta, length(leftindex), field, nothing)


As you noted in the issue description, this approach is not very efficient, and it doesn't work for rightmeta since it shouldn't be modified. Another way of doing this is in for key in keys(leftmeta.dict), to check whether the key exists in the left and right data frames. If it exists in both, call vcat as you currently do. If it exists only in one of the data frames, allocate a Vector{Union{Nothing, eltype(key_vec)}} and call copyto! to fill the corresponding entries.

bkamins · 2018-08-31T15:07:53Z

I have not been in this discussion earlier, and as I have mentioned to @pdeffebach ealier it would be great to have a write-up specifying what we want to achieve here.

Looking at the code I understand that we want to be able to attach arbitrary metadata to columns of a DataFrame (and not to a DataFrame as a whole). Right?

I can see that the design is to have a dictionary mapping keys to vectors of values for all columns of a DataFrame. So my first instinct is that we will have to be careful with maintenance and testing because there are many ways you can modify set of columns in a DataFrame and it is easy to loose integrity between metadata and the rest of the DataFrame structure. I can test this when some working prototype is passing CI and some docstrings allow me to be sure about the assumptions.

nalimilan · 2018-08-31T15:46:47Z

Yes, that's basically the goal. See #1413 and https://discourse.julialang.org/t/how-to-add-metadata-info-to-a-dataframe/11168.

bkamins · 2018-08-31T15:58:26Z

Thanks! Seeing the volume of the discussion I would recommend to document how it should work (even something like minimal dev docs would be good as we are getting to the point that the design of the whole package might become so complex that we are not sure when we change it one place what other parts might break).

pdeffebach · 2018-08-31T16:23:07Z

@bkamins thanks for the feedback. I have a plan to write a Stata script and some R package and do a battery of transforms and merges and see how they behave, but I haven't had time for it recently. However as far as I can tell there aren't that many systems out there that do this like Stata. So there will likely still be some behaviors unsettled.

I will try and get a non-joining API working based on Milan's comments. Then as you suggested a markdown document will be very helpful. The whole set of changes is simple for now, or at least isolated to a few files (a lot of the conversation is based on my lack of git understanding), but will get far more complicated once joins and constructors get involved. Hopefully I can work on this next week.

bkamins · 2018-08-31T16:40:12Z

Great. So just a few issues that came to my mind and should be decided upon (maybe they are):

do we handle makeunique;
do we handle column in insertion/deletion;
do we handle hcat and vcat.
do we handle getting metadata using column name and column number;
what do we do with groupby and stack/unstack.

(of course maybe most of those in the first round can be ignored and an appropriate mechanisms added later - the key is to be sure that we do not loose sync between metadata and data)

Also (referring to the discussion above) - if someone wants to assign metadata to the column that does not exist I think an error should be thrown.

nalimilan · 2018-08-31T16:46:38Z

I don't think these issues are hard, really. It's fine to drop meta-data in some operations like joins or grouping for now. We just have to handle column insertion and deletion, or the feature would really be confusing (as we would have to drop all meta-data). makeunique is completely orthogonal AFAICT: the column names are not even stored in the meta-data, it's just based on the ordering of columns. So we can automatically support retrieving columns by name or index.

bkamins · 2018-08-31T16:52:05Z

Agreed they do not have to be hard 😄, I just learned the hard way that sometimes simple things are not so simple as they seem.

But I agree that dropping metadata, at least for now, is OK. I would say that most of the time when you have metadata attached to a DataFrame you do not mutate it that much but rather use it as a reference.

pdeffebach · 2019-05-08T16:22:07Z

I would like to continue working on this now that the vcat has been merged and #1620 is about to be merged.

Are we still on board with this general approach? Do either of you trust me to rebase this myself?

bkamins · 2019-05-08T18:01:28Z

Rebasing src/DataFrames.jl and src/abstractdataframe/join.jl should be easy.
As for src/dataframe/dataframe.jl given the amount of conflicts I would consider (but choose what you prefer :)) to actually use the file from master and update it with the changes you need - this might be simpler and less error prone (especially for constructors).

From my side I would greatly appreciate if you added a section in docs/src/lib/types.md on medatata laying out the design assumptions - this would help me review the implementation. Also please remember to update docs/src/lib/functions.md.

Also adding some info to docs/src/man section would be welcome (but is not crucial for the review).

Thank you!

asinghvi17 · 2019-07-13T19:12:27Z

Bump @pdeffebach

bkamins · 2019-07-13T19:47:30Z

@pdeffebach If you plan to be at JuliaCon2019 this functionality is perfect to work on and discuss during the conference. We could move things forward fast there.

pdeffebach · 2019-07-16T18:03:40Z

I will unfortunately not be at JuliaCon. I have been at home for the summer before grad school so I haven't been thinking much about Julia.

I will try rebasing today at the very least so this PR isn't dead. After that I will work on documentation to write a contract for all this behavior.

bkamins · 2019-07-16T19:27:21Z

Thank you. A "contract" is exactly what we need. As I have said earlier - having thing like https://juliadata.github.io/DataFrames.jl/stable/lib/indexing.html is a must, in my opinion, for a complex change like this because:

we are sure that what we design is consistent
we can use test driven development (you can write test cases against a contract, even before implementing stuff)
it is easier, when reading the actual code, to check if it does what it should

Adding metadata will be a fantastic enchantment and I know that many people would love to have it.

bkamins · 2022-01-31T10:48:11Z

@pdeffebach - probably we should close this and keep the discussion on what we do in #2961. OK?

pdeffebach · 2022-01-31T16:37:55Z

Sounds good!

pdeffebach and others added 30 commits April 6, 2018 20:05

Make describe return a DataFrame

ec32311

asdf

Delete REQUIRE

3ac8af1

Add files via upload

c0227e9

fix rowvectors

a6fa29e

Update abstractdataframe.jl

d9621b4

Update abstractdataframe.jl

fc6ef8c

Update abstractdataframe.jl

c477684

Get rid of describe tests

215233f

delete tests

21936e1

Merge remote-tracking branch 'origin/master'

ab120a4

Add improved described

9f80ddf

Improve time for describe

bc3f2fa

Fix colstats vector in kw

e29de8b

Fix kw again

935d2f8

Improve kwargs closure

6561d89

Fix NUnique error

f9b5e77

Add docstring

01d9f86

edit docstring

01d836d

Merge pull request #1 from pdeffebach/describe_to_dataframe

9f69461

Describe to dataframe

finall fix noNunique error

c891344

more fixes

5ef7a16

fuck it more changes

d9db833

Respond to nalimilan's comments

4c54110

fix stats

ebfa988

fix stats again

fb3ca98

fix test

5d410e6

fix REQUIRE

e18be15

Respond to nalimilan's comments 2

fe02fbe

Rearranged get_stats, fixed docstring, re-did data constructor, and other small fixes.

fix indentation on test

707fea7

added bad stuff

d163c5e

pdeffebach added 2 commits July 14, 2018 17:20

just metadat changes 2

962afa0

Merge remote-tracking branch 'origin/metadata' into metadata

e160375

pdeffebach changed the title ~~Continue addint Metadata to dataframes~~ Continue adding Metadata to dataframes Jul 18, 2018

nalimilan reviewed Aug 31, 2018

View reviewed changes

bkamins mentioned this pull request Jan 15, 2019

DataFrames.jl roadmap #1678

Closed

31 tasks

pdeffebach mentioned this pull request Apr 3, 2019

move DataFrame type definition to lightweight package #1764

Open

bkamins mentioned this pull request Jul 24, 2019

Add metadata support to DataFrames #1413

Closed

bkamins mentioned this pull request Feb 12, 2020

Retire this package? invenia/KeyedFrames.jl#19

Open

bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020

bkamins added this to the 2.0 milestone Feb 12, 2020

nalimilan mentioned this pull request May 23, 2020

metadata method JuliaData/DataAPI.jl#22

Closed

pdeffebach mentioned this pull request Jun 2, 2020

What metadata should be #2276

Closed

pdeffebach closed this Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continue adding Metadata to dataframes #1458

Continue adding Metadata to dataframes #1458

pdeffebach commented Jul 18, 2018

pdeffebach commented Aug 16, 2018 •

edited by nalimilan

Loading

pdeffebach commented Aug 18, 2018

nalimilan left a comment

nalimilan Aug 31, 2018

bkamins commented Aug 31, 2018

nalimilan commented Aug 31, 2018

bkamins commented Aug 31, 2018

pdeffebach commented Aug 31, 2018

bkamins commented Aug 31, 2018

nalimilan commented Aug 31, 2018

bkamins commented Aug 31, 2018

pdeffebach commented May 8, 2019

bkamins commented May 8, 2019

asinghvi17 commented Jul 13, 2019

bkamins commented Jul 13, 2019

pdeffebach commented Jul 16, 2019

bkamins commented Jul 16, 2019

bkamins commented Jan 31, 2022

pdeffebach commented Jan 31, 2022

Continue adding Metadata to dataframes #1458

Continue adding Metadata to dataframes #1458

Conversation

pdeffebach commented Jul 18, 2018

Allocation for merge!

Getindex and setindex

Mis-typing

Future work

pdeffebach commented Aug 16, 2018 • edited by nalimilan Loading

pdeffebach commented Aug 18, 2018

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan Aug 31, 2018

Choose a reason for hiding this comment

bkamins commented Aug 31, 2018

nalimilan commented Aug 31, 2018

bkamins commented Aug 31, 2018

pdeffebach commented Aug 31, 2018

bkamins commented Aug 31, 2018

nalimilan commented Aug 31, 2018

bkamins commented Aug 31, 2018

pdeffebach commented May 8, 2019

bkamins commented May 8, 2019

asinghvi17 commented Jul 13, 2019

bkamins commented Jul 13, 2019

pdeffebach commented Jul 16, 2019

bkamins commented Jul 16, 2019

bkamins commented Jan 31, 2022

pdeffebach commented Jan 31, 2022

pdeffebach commented Aug 16, 2018 •

edited by nalimilan

Loading