-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Layers #17
Data Layers #17
Conversation
I thought we'd discussed not using |
Yes, but then we decided to use finalizers instead, where possible, and not expose the context to users at all. That's what was implemented in #12 for You'll note that #12 contains no mention of ResourceContexts.jl in the documentation update. Also, the above is purely optional use of ctx = ResourceContext()
data = open(ctx, dataset("table_tsv")) |
Of course, the issue with the finalizer approach is that it doesn't work with some third-party types such as |
Data layers allow data of different formats to be mapped into a program through a decoder and presented with a uniform API such that the main program logic can avoid dealing with data format decoding. Instead, the data format can be defined in the Data.toml. A challenge here is dealing with world age issues which come up from dynamically `require`ing Julia packages. For now, we include a bit of judicious Base.invokelatest to make things "just work" in the REPL, but also warn the user that they should add a top-level import. Here's some examples of data which comes encoded in many forms, but we'd like to treat through the same API: Tabular data: * csv * gzip.csv * tsv * arrow * parquet * ... Byte streams: * raw * gzip * xz * zstd * ... Images * png * jpeg * tiff * ... Data trees * directories * zip * hdf5
Return a mutable wrapper object, perhaps? Either that or if the object is immutable, throw an error and require the caller to use the explicit context form (or the |
Thanks, I think these are the options. I've been mulling it over but haven't come up with anything else yet. With wrappers, there seems to be two alternatives
All together, wrappers don't seem very appealing. I'm inclined to just error and direct the user to the explicit context-based API for the generic code path. As a hybrid, we could implement a few wrappers for APIs which are relatively well defined and commonly used, eg, tables. |
Honestly it seems most appealing to me to just always require the context object. Once people learn to do this it will always work. |
hmm. so I could a file that is encrypted + compressed, and layers would allow the program to peel this back to handle that on the fly? what other types of preprocessing could be layers? user-defined layers? |
Yes, this should be possible. I think the interesting/tricky thing here is having a way to provide parameters to layers. In particular, how would we inject the decryption keys in a secure way? I suppose these are logically a property of the
Anything that represents a linear pipeline of decoding stages could be represented. (Conversely, more general DAGs cannot be represented as cleanly — the whole DAG would have to be represented as single non-composable layer.)
Yes, in this PR the user should be able to define their own layer by calling |
I'll go ahead and close this PR, since I don't think we'll merge it. But the branch and discussion will stay around for future reference. |
Data layers allow data of different formats to be mapped into a program through a decoder and presented with a uniform API such that the main program logic can avoid dealing with data format decoding. Instead, the data format can be defined in the Data.toml.
A challenge here is dealing with world age issues which come up from dynamically
require
ing Julia packages. For now, we include a bit of judicious Base.invokelatest to make things "just work" in the REPL, but also warn the user that they should add a top-level import.With this patch and the Data.toml from the tests, we can open several tabular data formats, without the user needing to know much about the data storage.
Here's an example of loading data in .tsv, .gzip.csv and .arrow formats (any of which could then be converted to a DataFrame thanks to the Tables.jl interface)
Excerpt from Data.toml, showing the configuration required for the system to understand these various formats:
More generally than tabular data, here's some further examples of data which comes encoded in many forms, but we'd like to treat through the same data loader API:
Byte streams:
Images
Data trees