Add metadata support to DataFrames #1413

gcalderone · 2018-05-26T18:34:38Z

This PR adds metadata support to DataFrames. The idea for this PR comes from this discussion

Metadata are internally stored as a Dict{Any,Any}, one for each column and one for the whole table.

Metadata access is performed through the following methods:

metaget(df::DataFrame, key; default=nothing): returns the metadata entry with key key from the table dictionary. If the key is not present the value of the default keyword will be returned;
metaget(df::DataFrame, column::Symbol, key; default=nothing): returns the metadata entry with key key from the column column dictionary. If the key is not present the value of the default keyword will be returned;
metaset!(df::DataFrame, key, value): set an entry in the table metadata dictionary with key key and value value;
metaset!(df::DataFrame, column::Symbol, key, value): set an entry in the column column metadata dictionary with key key and value value;
metadict(df::DataFrame): return the table dictionary;
metadict(df::DataFrame, column::Symbol): return the column column dictionary.

Example:

using DataFrames
df = DataFrame(:col1=>1, :col2=>[1,2])
showcols(df)

# Request a non-present key in the table metadata dictionary
println("Table source: ",  metaget(df, :source, default="Unknown"))

# Set an entry in the table metadata dictionary and read it back
metaset!(df, :source, "www.some.site")
println("Table source: ",  metaget(df, :source))

# Set an entry using a `String` as key
metaset!(df, "query", "The query used to retrieve the data...")

# Display the table metadata dictionary
display(metadict(df))

# Request non-present keys in the column metadata dictionary
println("Column descr.: ",  metaget(df, :col1, :descr, default="Unspecified"))
println("Column unit  : ",  metaget(df, :col1, :unit,  default="Unspecified"))

# Set entries in the column dictionary and read them back
metaset!(df, :col1, :descr, "First column")
metaset!(df, :col1, :unit , "km / s")
println("Column descr.: ",  metaget(df, :col1, :descr))
println("Column unit  : ",  metaget(df, :col1, :unit ))

# `showcols` now search for the `:descr` and `:unit` entries in column
# dictionaries.  If these are available, and the values can be
# converted to a `String`, they are also printed
showcols(df)

# Display the column metadata dictionary
display(metadict(df, :col1))

# Explore the column metadata dictionary
for (key, val) in metadict(df, :col1)
    println("$key = $val")
end

This PR adds no package dependencies and is backward compatible. All new methods have their own docstring, and a new module has been added to test the new facility (in test/meta.jl)

kescobo · 2018-05-26T18:52:54Z

This is awesome. 🎉

I don't love the names metaset and metaget. ¯_(ツ)_/¯

gcalderone · 2018-05-26T21:44:19Z

Well, I hope the method names will be the only thing to change ... ;-)
Do getmeta, setmeta! and dictmeta sound better?

nalimilan · 2018-05-27T10:00:13Z

Thanks for taking the initiative!

Here are a few general remarks:

I agree metaget and metaset don't look very Julian. I'd suggest meta and setmeta!/meta! (Document preferred naming convention for getters/setters in style guide JuliaLang/julia#16770).
I'm not sure we should expose the implementation to users: meta and setmeta! should be enough, no need for metadict. We can always add it later if it's really useful, but better start with a minimal API.
I'd also rather restrict the type of the keys to Symbol (or String?) for now, as I don't think we have a strong use case for other types. In particular this will limit inconsistencies, with some packages using symbols and other strings.
Regarding the implementation, I think it would be more efficient to have only two dicts, one for the global meta-data and one for column-specific meta-data. The second one would store vectors of values in the same order as columns. Then you don't need to modify the Index type (which is just a mechanism to lookup the index of columns from their names). Columns for which a key isn't available can have a nothing entry, and valid value can be wrapped inside Some to allow the user to explicitly store nothing (as Some(nothing)). That way we don't have to create a Dict object (which is relatively expensive) for each column, and it will be faster to retrieve all column-specific values for a given key.
You'll have to adapt all setindex! and getindex methods to handle the column meta-data. Tests don't cover this currently.
It would be useful to have a look at whether/how other software does this. AFAICT neither dplyr, data.table nor Pandas (Allow custom metadata to be attached to panel/df/series? pandas-dev/pandas#2485) support meta-data, but maybe there are other apps than Stata (which is quite restrictive)? In particular it would make sense to identify the most common kinds of meta-data, and document standard key names for them.

gcalderone · 2018-05-27T11:32:13Z

I'd suggest meta and setmeta!/meta!

Done, new method names are: meta, setmeta! and metakeys (to retrieve metadata keys);

no need for metadict

metadict is still present but no longer exported;

I'd also rather restrict the type of the keys to Symbol (or String?)

I agree we have no strong use cases to use Any for keys, still I think that both Symbol and String may be useful.

For instance, we may suggest users to adopt Symbol keys for quantities supposed to be read/interpreted by other programs, and String keys for quantities to be displayed (e.g. plot labels).

I think it would be more efficient to have only two dicts, one
for the global meta-data and one for column-specific
meta-data. The second one would store vectors of values in the
same order as columns.

I'm not sure I understood. Are you proposing:

meta::Dict{Symbol, Any} # global
colmeta::Dict{Symbol, Vector{Any}} # column specific

?

This approach is not flexible enough since you would not have column specific keys (i.e. :col1, :unit). Rather I would prefer:

meta::Dict{Symbol, Any} # global
colmeta::Dict{Symbol, Dict{Symbol,Any}} # column specific

but also this approach is not convenient since colmeta should be updated each time the colindex is updated. Hence, I think it is better to modify the Index type (as I did).

Or maybe I'm missing something?

You'll have to adapt all setindex! and getindex methods

Sorry, I don't understand why. Data and metadata live in two separate objects, and setindex! and getindex only operate on data.

Moreover, when a column is added/deleted the corresponding Dict is added/deleted accordingly.

Could you please elaborate on this ?

gcalderone · 2018-05-27T11:36:23Z

Update of the first comment:
This PR adds metadata support to DataFrames. The idea for this PR comes from this discussion

Metadata are internally stored as a Dict{Union{Symbol,String},Any}, one for each column and one for the whole table.

Metadata access is performed through the following methods:

meta(df::DataFrame, key::Union{Symbol,String}; default=nothing): returns the metadata entry with key key from the table dictionary. If the key is not present the value of the default keyword will be returned;
meta(df::DataFrame, column::Symbol, key::Union{Symbol,String}; default=nothing): returns the metadata entry with key key from the column column dictionary. If the key is not present the value of the default keyword will be returned;
metaset!(df::DataFrame, key::Union{Symbol,String}, value): set an entry in the table metadata dictionary with key key and value value;
metaset!(df::DataFrame, column::Symbol, key::Union{Symbol,String}, value): set an entry in the column column metadata dictionary with key key and value value;
metakeys(df::DataFrame): return the keys in the table dictionary;
metakeys(df::DataFrame, column::Symbol): return the keys in the column column dictionary.

Example:

using DataFrames
df = DataFrame(:col1=>1, :col2=>[1,2])
showcols(df)

# Request a non-present key
println("Table source: ",  meta(df, :source, default="Unknown"))

# Set an entry in the dictionary and read it back
metaset!(df, :source, "www.some.site")
println("Table source: ",  meta(df, :source))

# Set an entry using a string as key
metaset!(df, "query", "The query used to retrieve the data...")

# Request non-present keys in the column dictionaries
println("Column descr.: ",  meta(df, :col1, :descr, default="Unspecified"))
println("Column unit  : ",  meta(df, :col1, :unit,  default="Unspecified"))

# Set entries in the column dictionaries and read them back
metaset!(df, :col1, :descr, "First column")
metaset!(df, :col1, :unit , "km / s")
println("Column descr.: ",  meta(df, :col1, :descr))
println("Column unit  : ",  meta(df, :col1, :unit ))

# `showcols` now search for the `:descr` and `:unit` entries in column
# dictionaries.  If these are available, and the values can be
# converted to a `String`, they are also printed
showcols(df)

# Explore the column metadata dictionary
for key in metakeys(df, :col1)
    println("$key = ", meta(df, :col1, key))
end

This PR adds no package dependencies and is backward compatible. All new methods have their own docstring, and a new module has been added to test the new facility (in test/meta.jl)

nalimilan · 2018-05-27T12:53:38Z

Done, new method names are: meta, setmeta! and metakeys (to retrieve metadata keys);

Ah, indeed now that you mention metakeys I realize we need a way to get the names of available keys.

I wonder whether it wouldn't be better to provide a single meta function which would return an object of the DataFrameMetadata <: AbstractDict type. Then you'd do m = meta(df), keys(m), m[:key, :col], and m[:key, :col] = .... And just m[:key] to set global meta-data. Not sure which approach is the best one.

I agree we have no strong use cases to use Any for keys, still I think that both Symbol and String may be useful.

For instance, we may suggest users to adopt Symbol keys for quantities supposed to be read/interpreted by other programs, and String keys for quantities to be displayed (e.g. plot labels).

It would be confusing to allow for both strings and symbols. Anyway they are displayed the same, so I'm not sure why you say strings are better in that case?

I'm not sure I understood. Are you proposing:

meta::Dict{Symbol, Any} # global
colmeta::Dict{Symbol, Vector{Any}} # column specific

?

This approach is not flexible enough since you would not have column specific keys (i.e. :col1, :unit). Rather I would prefer:

meta::Dict{Symbol, Any} # global
colmeta::Dict{Symbol, Dict{Symbol,Any}} # column specific

but also this approach is not convenient since colmeta should be updated each time the colindex is updated. Hence, I think it is better to modify the Index type (as I did).

Or maybe I'm missing something?

I mean the former (but with colmeta::Dict{Symbol, Vector}). It allows for column-specific keys, it just requires storing a nothing entry for columns where the property isn't set.

Both approaches are equivalent from the user's POV, it's just a matter of efficiency in typical use cases. Updating the meta-data when the index is modified isn't an issue, it just requires a few additional function calls when adding or removing columns.

Sorry, I don't understand why. Data and metadata live in two separate objects, and setindex! and getindex only operate on data.

Moreover, when a column is added/deleted the corresponding Dict is added/deleted accordingly.

Could you please elaborate on this ?

I mean that things like df[1:3], df[:, 1:3], df[1:10, :] and df[1:0, 1:3] should return a DataFrame with the meta-data from columns 1 to 3. There are a few getindex variants which need to handle this. Also, it's probably worth thinking about whether we should keep column meta-data or drop it when replacing columns, e.g. via df[1] = v.

kescobo · 2018-05-27T13:38:13Z

Piggybacking on the point about indexing - how are the view() family of functions implemented? Will it just work to use these methods on subdataframes?

nalimilan · 2018-05-27T13:45:57Z

Piggybacking on the point about indexing - how are the view() family of functions implemented? Will it just work to use these methods on subdataframes?

Good point. view just creates a SubDataFrame, so we will need to delegate methods to the parent DataFrame.

pdeffebach · 2018-05-27T14:02:56Z

It would be useful to have a look at whether/how other software does this. AFAICT neither dplyr, data.table nor Pandas (pandas-dev/pandas#2485) support meta-data,

R supports metadata via attributes. It's just an array of Strings. Granted, in R you can add an attribute to any object, but it is really only useful for dataframes. I do use attributes with R, however. It's easy to write a simple plotting function that calls attributes[df$x][1].

dmbates · 2018-05-27T16:47:27Z

@pdeffebach I would take issue with your statement that R attributes are only useful for data frames. Any complexity of R (and before R, S) objects is determined by attributes. Dimensions of matrices and higher-order arrays, for example, are stored as attributes. The only primitive R data objects, in the sense of SEXPREC's, are fixed-length vectors of 32-bit integers, fixed-length vectors of 64-bit floats, fixed-length vectors of interned character strings, and fixed-length vectors of pointers to SEXPRECs. Everything else is coded in the attributes.

I don't think that R attributes are a good model for this facility.

dmbates · 2018-05-27T16:52:37Z

I think that @nalimilan's suggestion of having an extractor that returns a Dict is the best way to go. Could I make a plea for it to be named metadata instead of meta? meta could be about metaprogramming, etc. Especially if the use is to be a pattern like m = metadata(df); m[:key] etc. I think the clarity of the name outweighs the cost of typing 4 more characters.

nalimilan · 2018-05-27T18:20:02Z

@gcalderone Maybe wait until the resolution of the discussion in the Discourse thread to avoid wasting your time if we choose the alternative approach (storing meta-data in vectors).

Reference to previous discussion: #35.

…slicing a DataFrame

gcalderone · 2018-05-28T00:16:58Z

Implemented metadata copying while copying/slicing/creating a view of the DataFrame. I also added a showmeta method to pretty print metadata contents.

Examples:

# Create a main DataFrame
df = DataFrame(:col1=>1, :col2=>1:10, :col3=>"dummy")
metaset!(df, :key1, "val1")
metaset!(df, :col1, :key1, "val1")
showmeta(df) # pretty print metadata

# Copy
c = copy(df)  # both data and metadata are copied
c[:col1] *= 2
metaset!(c, :key1, "UPDATED")
metaset!(c, :col1, :key1, "UPDATED")
showmeta(c)  # updated
showmeta(df) # unchanged

# Slice
sub = df[2:5, [:col1]]
metaset!(sub, :key1, "UPDATED")
metaset!(sub, :col1, :key1, "UPDATED")
showmeta(sub)  # updated
showmeta(df) # unchanged

# View
vv = view(df, 2:5, [:col1])
metaset!(df, :key1, "UPDATED")
metaset!(df, :col1, :key1, "UPDATED")
showmeta(vv)  # updated
showmeta(df)  # updated

# Insert a DataFrame
add = DataFrame(:col4=>rand(size(df)[1]))
metaset!(add, :key1, "ADDITIONAL")
metaset!(add, :col4, :key1, "ADDITIONAL")
df[[:col1]] = add
showmeta(df)

# Merge two DataFrame objects
merge!(df, add)
showmeta(df)

#Empty metadata dictionaries
emptymeta!(df)
showmeta(df)

nalimilan · 2018-06-02T15:06:02Z

Sorry for the delay. After thinking a bit more about this, I think it would be cleaner to defined a DataFrameMetadata type which would be handled a little like Index in most functions: for example getindex(::DataFrame, ::Vector{Symbol}) would call getindex on it and pass the resulting DataFrameMeta object to the DataFrame constructor. Global meta-data would always be preserved when indexing that object.

Then, as noted above, you don't need to add all these new functions: people can get and set meta-data using getindex, setindex! and keys, and it can be printed via the standard show function.

The definition of the type and of its methods should go to a separate src/dataframemeta/dataframemeta.jl file. Please also drop MetaKey in favor of Symbol. Better keep things simple.

pdeffebach · 2018-06-08T03:03:55Z

Okay as far as I can tell this means

Adding a new DataFrame constructor method in the type definition to allow a constructor with a metadata type, but (thanks to multiple dispatch) leave the other constructor alone
Update copy and deepcopy to use this new constructor
Update getindex

Then we can do this with just function metadata(df) = getfield(df, :metadata) and

Base.copy(df::DataFrame) = DataFrame(copy(columns(df)), copy(index(df)), copy(metadata(df)))

Without the introduction of copymeta! etc.

pdeffebach · 2018-06-08T03:31:23Z

I also see what you mean with regards to MetaData behaving like an Index. We need to define names! etc to act on a MetaData type and have that be called in abstractdataframe.jl. However having a call so names!(::MetaData) means that any AbstractDataFrame, should someone define their own type <: AbstractDataFrame, would have to overwrite that method.

The way that this PR would get around this is by having colindex know a decent amount about the metadata of a dataframe. It seems like a more streamlined approach would be to keep them entirely separate, but edit metadata when you edit colindex, where relevant (I'm sure there are other examples outside of renaming).

@gcalderone @nalimilan should we finalize the changes to dataframes.jl before moving forward with how MetaData might actually behave?

nalimilan · 2018-06-08T09:58:15Z

The way that this PR would get around this is by having colindex know a decent amount about the metadata of a dataframe. It seems like a more streamlined approach would be to keep them entirely separate, but edit metadata when you edit colindex, where relevant (I'm sure there are other examples outside of renaming).

Yes, better keep them separate. I'm not sure why renaming should affect meta-data: column-specific meta-data should be stored by position rather than by name, so that it only needs to be adjusted when reordering columns (which is less frequent).

@gcalderone @nalimilan should we finalize the changes to dataframes.jl before moving forward with how MetaData might actually behave?

Yeah, I guess it makes sense to start with a minimal implementation. However it should probably support at least a few operations so that it's testable and at least minimally usable.

pdeffebach · 2018-06-09T02:40:54Z

column-specific meta-data should be stored by position rather than by name, so that it only needs to be adjusted when reordering columns (which is less frequent).

I was trying to implement a lot of the functions for Index on a MetaData type which was just two Dicts and ran into the problem with names!. There are a lot of functions in DataFrames that are agnostic to the current names of the dataframe. With names! there is no for loop with individual renaming.

Then I realized that's the whole point of the Index, to keep track of all this. Consequently, I'm going to try an implementation where MetaData is just an array of Dict{Symbol, String}. Then I can use the exact same architecture that relates colindex with columns to relate colindex with metadata.

I will worry about global metadata later since that seems like it will be easier to add after the indexing and renaming functions are taken care of.

edit: It should probably be a Dict{Int, Dict{Symbol, String}} eventually so we don't automatically create a bunch of Dicts when a DataFrame is created.

nalimilan · 2018-06-09T15:32:47Z

I was trying to implement a lot of the functions for Index on a MetaData type which was just two Dicts and ran into the problem with names!. There are a lot of functions in DataFrames that are agnostic to the current names of the dataframe. With names! there is no for loop with individual renaming.

Then I realized that's the whole point of the Index, to keep track of all this. Consequently, I'm going to try an implementation where MetaData is just an array of Dict{Symbol, String}. Then I can use the exact same architecture that relates colindex with columns to relate colindex with metadata.

Sorry, I don't understand. Wouldn't storing column-specific meta-data in a vector allow precisely to be agnostic to column names and only use their integer indices instead? Then you can use the index to get the integer index from the name, which the code already does anyway.

pdeffebach · 2018-06-09T17:22:27Z

Then you can use the index to get the integer index from the name, which the code already does anyway.

Yes, that is exactly what I am trying to implement. A first step will be to have MetaData be an array of dicts, one for every column, and work from there. That way we can use Base getindex functions etc. exactly like the columns(df).

In the future, we might not want to initiate an array of empty Dicts every time, because that might get annoying for DataFrames with large amounts of columns, and have a cleverer approach that only creates Dicts to store strings on the variables the user wants. Every column in a DataFrame has a columns of data, but we don't want every column in a DataFrame to have metadata, necessarily.

However this just means clever versions of getindex, permute etc. for the metadata type. We should write code in dataframe.jl that works assuming MetaData contains a vector of Dicts, then change the way MetaData actually works later.

nalimilan · 2018-06-09T18:22:21Z

OK. But then why use dicts when vectors would be simpler and more efficient?

pdeffebach · 2018-06-09T18:26:23Z

The purpose of using Dicts as a whole was so that people could add particular information like :unit, maybe :source etc. So metadata for each column is a dictionary.

I am imagining a "default" :label key in these Dicts so that other packages can get printable labels easily.

I also changed my mind about a vector of Dicts anyways, because then whenever we add a new column, we have to add push an empty Dict onto the array of Dicts. Basically we would have to touch all the setindex! code. It's easier to have the user make a new dictionary when they want to.

nalimilan · 2018-06-09T18:47:16Z

We clearly need a dictionary to map the user-defined meta-data types to their values. But better store values for each type of meta-data in a vector with one entry per column. Yes, you need to resize the vector each time you add a column, but calling getindex will be much less expensive than with one dict per column (copying lots of dicts is going to slow and the cost increases with the number of columns) and it's a much more frequent operation.

pdeffebach · 2018-06-09T19:06:01Z

Are you saying a vector

units = array of strings
sources = array of strings

If that's the case, I'm not sure thats a great idea because unit might not be applicable to many columns. Rather, the user will add a unit metadata to columns on an as-needed basis. If a user wants to add a metadata, say, :transformation entry to just one column, that would involve creating a whole new array of strings, where there is only one non-empty value.

Here is what I have written. I just finished the getindex implementation and need to add a few more metadata entry functions and tests before I can make a PR.

# Defining behavior for DataFrames metadata
abstract type AbstractMetaData end

mutable struct MetaData <: AbstractMetaData
	columndata::Dict{Int, Dict{Symbol, String}}
end

MetaData() = MetaData(Dict{Int,Dict{Symbol, String}}())

function MetaData(x::Array{Dict{Symbol, String}}) 
	columndata = Dict{Int, Dict{Symbol, String}}()
	for i in eachindex(x)
		columndata[i] = x[i]
	end
	MetaData(columndata)
end

function Base.getindex(x::MetaData, col_inds::AbstractVector)
	dictarray = [x.columndata[i] for i in col_inds if haskey(x.columndata, i)]
	MetaData(dictarray)
end

So if we have a dataframe

│ Row │ x1       │ x2        │ x3       │ x4       │ x5       │ y       │
├─────┼──────────┼───────────┼──────────┼──────────┼──────────┼─────────┤
│ 1   │ 0.384981 │ 0.760864  │ 0.432747 │ 0.277874 │ 0.768403 │ 2.5135  │
│ 2   │ 0.247484 │ 0.543756  │ 0.30999  │ 0.623039 │ 0.181284 │ 69.9532 │
│ 3   │ 0.937264 │ 0.392081  │ 0.803099 │ 0.855908 │ 0.164826 │ 76.241  │
│ 4   │ 0.510181 │ 0.818951  │ 0.464661 │ 0.680837 │ 0.575248 │ 24.2244 │
│ 5   │ 0.936744 │ 0.0300326 │ 0.568476 │ 0.188381 │ 0.135375 │ 59.0194 │

Our metadata.columndata looks like this:

Dict(
1 => Dict(:label => "label for x1")
2 => Dict(:label =>"label for x2")
...
)

and we want df[[:x3, :x4, :x5]], then we go through the above columndata Dict and make an array of the Dicts for each column, but only if their column index is 3, 4, or 5.

Then there is a constructor that makes a new MetaData instance based on that array of Dicts. This allows for the re-indexing to happen just as though we had a vector of Dicts, but without the overhead of having a vector of Dicts. I was under the impression this would be more performant, since only the objects that are needed are created.

If I understand correctly, you are saying that it is this copying of Dicts into an array that is expensive and undesirable. But can it really be less performant than a new array of for each metadata field (units, etc.)? Perhaps the answer hinges on our expectations for the ratio of unlabeled to labeled variables and the ratio of common to unique metadata fields.

nalimilan · 2018-06-09T20:06:01Z

If that's the case, I'm not sure thats a great idea because unit might not be applicable to many columns. Rather, the user will add a unit metadata to columns on an as-needed basis. If a user wants to add a metadata, say, :transformation entry to just one column, that would involve creating a whole new array of strings, where there is only one non-empty value.

That's not a big deal. Typically that will use 64 bits per column, which is nothing compared to the size of the columns themselves. And dicts consist in three arrays, using 336 bits by default even when empty for Dict{Symbol,String}.

If I understand correctly, you are saying that it is this copying of Dicts into an array that is expensive and undesirable. But can it really be less performant than a new array of for each metadata field (units, etc.)? Perhaps the answer hinges on our expectations for the ratio of unlabeled to labeled variables and the ratio of common to unique metadata fields.

Exactly. I assume it's unlikely you will use lots of meta-data fields that will differ from one column to another. As I noted copying a dict involves copying three vectors, plus five Int. Copying one vector for each meta-data field should be cheap compared to that, unless you have much more fields than column.

pdeffebach · 2018-06-10T03:31:31Z

Thank you for the guidance!

So my impression is that your vision of MetaData would be like

mutable struct MetaData
columndata::Dict{symbol => array of String} # maybe Union{Void, String}
end
...
addmeta(df, :var, :unit, "km/hr")
...
colindex = column index of :var in df
if adding `units` for first time 
    columndata[unit] = ["" for i in ncol(df)] # maybe `nothing` or something
    # or even a sparse vector if we were really worried about memory
end
columndata[unit][colindex] = "km/hr"

Then getindex etc. will essentially just be broadcasted or mapped for the columndata dict.

Unless you wanted a more DataFrames type scenario where there is a Dict that just maps symbols to indices, then another array of arrays (or matrix) that actually contains the information. I'm not 100% sure if its just having many Dicts that are undesirable or if having one larger Dict is also a problem, and its better to store the info in a better format as well.

`

nalimilan · 2018-06-10T10:55:54Z

Yes, more or less something like that. It's fine to have a single dict to map meta-data fields to the vectors that hold the values.

khughitt · 2018-10-06T19:15:06Z

This may be outside the desired scope of this effort, but have you considered extending support to include both column and row metadata?

An example of where this would be useful would be something like genomic data where each row might be a gene, and each column a sample or patient. Having associated metadata for both would be useful.

nalimilan · 2018-10-06T20:12:09Z

Per-row metadata is just... a column? Am I missing something? Do you know of other software which supports this?

khughitt · 2018-10-06T20:22:45Z

@nalimilan I guess I am thinking of cases where you have something more like homogeneous numeric data in most columns and you wouldn't necessarily want to mix in "metadata" columns that represent something else. In this case though I suppose an annotated multidimensional array is really what I am looking for..

As far as something already implementing this, I've started work on something like this in R, but it is still very immature and far-from-perfect, which is why I'm exploring what's been done in other communities ;)

nalimilan · 2018-10-06T20:34:39Z

I guess you could use per-column meta-data to indicate which columns are "real data" and which are "metadata". :-)

We haven't implemented anything to select columns based on criteria (regexes, name ranges, types...) yet, but we should certainly investigate this area (like dplyr and JuliaDB).

bkamins · 2019-07-24T14:38:09Z

@gcalderone + @pdeffebach

I am not entirely clear on the relation of this PR and #1458.
If I understand the things correctly they are overlapping (but I might have missed something - then please correct me).

So my question is: which of them should be left open, or they both should be closed and we should reopen a new PR that is rebased to v0.19 and implements target functionality (from my experience it is sometime simpler than trying to update old PRs)?

pdeffebach · 2019-07-24T17:58:07Z

This PR should be closed, as #1458 supersedes it.

Yes I think that we should open a new PR attempting this again with the progress of #1458 added on top.

bkamins · 2019-07-24T18:01:17Z

OK - so I am closing this and leave #1458 open as a "placeholder" until a new PR is opened (when #1458 should be closed).

gcalderone added 2 commits May 26, 2018 19:43

First commit

d70ac2d

Added docstring to methods

6db55d9

Methods renamed; meta keys are now Union{Symbol,String}

ba7a5a2

Implemented several new methods to copy/merge metadata while copying/…

379763a

…slicing a DataFrame

pdeffebach mentioned this pull request Jun 15, 2018

A second attempt at DataFrames Metadata #1429

Closed

nalimilan mentioned this pull request Aug 31, 2018

Continue adding Metadata to dataframes #1458

Closed

jsm296 mentioned this pull request Feb 8, 2019

potential for adding metadata to pandas data frames jupyterlab/jupyterlab-metadata-service#10

Closed

bkamins closed this Jul 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metadata support to DataFrames #1413

Add metadata support to DataFrames #1413

gcalderone commented May 26, 2018

kescobo commented May 26, 2018

gcalderone commented May 26, 2018

nalimilan commented May 27, 2018

gcalderone commented May 27, 2018

gcalderone commented May 27, 2018

nalimilan commented May 27, 2018

kescobo commented May 27, 2018

nalimilan commented May 27, 2018

pdeffebach commented May 27, 2018

dmbates commented May 27, 2018

dmbates commented May 27, 2018

nalimilan commented May 27, 2018

gcalderone commented May 28, 2018

nalimilan commented Jun 2, 2018

pdeffebach commented Jun 8, 2018 •

edited

Loading

pdeffebach commented Jun 8, 2018 •

edited

Loading

nalimilan commented Jun 8, 2018

pdeffebach commented Jun 9, 2018 •

edited

Loading

nalimilan commented Jun 9, 2018

pdeffebach commented Jun 9, 2018 •

edited

Loading

nalimilan commented Jun 9, 2018

pdeffebach commented Jun 9, 2018 •

edited

Loading

nalimilan commented Jun 9, 2018

pdeffebach commented Jun 9, 2018 •

edited

Loading

nalimilan commented Jun 9, 2018

pdeffebach commented Jun 10, 2018 •

edited

Loading

nalimilan commented Jun 10, 2018

khughitt commented Oct 6, 2018

nalimilan commented Oct 6, 2018

khughitt commented Oct 6, 2018 •

edited

Loading

nalimilan commented Oct 6, 2018

bkamins commented Jul 24, 2019

pdeffebach commented Jul 24, 2019

bkamins commented Jul 24, 2019

Add metadata support to DataFrames #1413

Add metadata support to DataFrames #1413

Conversation

gcalderone commented May 26, 2018

kescobo commented May 26, 2018

gcalderone commented May 26, 2018

nalimilan commented May 27, 2018

gcalderone commented May 27, 2018

gcalderone commented May 27, 2018

nalimilan commented May 27, 2018

kescobo commented May 27, 2018

nalimilan commented May 27, 2018

pdeffebach commented May 27, 2018

dmbates commented May 27, 2018

dmbates commented May 27, 2018

nalimilan commented May 27, 2018

gcalderone commented May 28, 2018

nalimilan commented Jun 2, 2018

pdeffebach commented Jun 8, 2018 • edited Loading

pdeffebach commented Jun 8, 2018 • edited Loading

nalimilan commented Jun 8, 2018

pdeffebach commented Jun 9, 2018 • edited Loading

nalimilan commented Jun 9, 2018

pdeffebach commented Jun 9, 2018 • edited Loading

nalimilan commented Jun 9, 2018

pdeffebach commented Jun 9, 2018 • edited Loading

nalimilan commented Jun 9, 2018

pdeffebach commented Jun 9, 2018 • edited Loading

nalimilan commented Jun 9, 2018

pdeffebach commented Jun 10, 2018 • edited Loading

nalimilan commented Jun 10, 2018

khughitt commented Oct 6, 2018

nalimilan commented Oct 6, 2018

khughitt commented Oct 6, 2018 • edited Loading

nalimilan commented Oct 6, 2018

bkamins commented Jul 24, 2019

pdeffebach commented Jul 24, 2019

bkamins commented Jul 24, 2019

pdeffebach commented Jun 8, 2018 •

edited

Loading

pdeffebach commented Jun 8, 2018 •

edited

Loading

pdeffebach commented Jun 9, 2018 •

edited

Loading

pdeffebach commented Jun 9, 2018 •

edited

Loading

pdeffebach commented Jun 9, 2018 •

edited

Loading

pdeffebach commented Jun 9, 2018 •

edited

Loading

pdeffebach commented Jun 10, 2018 •

edited

Loading

khughitt commented Oct 6, 2018 •

edited

Loading