Open
Description
openedon May 6, 2022
Here's a rough list of items I'm considering on the path to a DataSets-1.0 release. Several of these can and should be done prior to version 1.0 in case the APIs need to be adjusted a bit before the 1.0 release.
- Streamline access for small datasets by providing a "high level" API for use when working with a fully in-memory representation of the data which doesn't require the management of separate resources. ("Separate resources" would be things like managing an on-disk cache of the data, incremental async download/upload; that kind of thing.). Perhaps we can use the verbs
load()
/save()
for this — thinking of DataSets.jl as a new FileIO.jl, I think this would make sense. (Actually, this isn't breaking, so it doesn't need to wait for 1.0.) - Somehow allow
load()
andsave()
to return some "default type the user cares about" for convenience. For example, returning aDataFrame
for a tabular dataset. This will require addressing the problems of dynamically loading Julia modules that were partially faced in Data Layers #17 - Consider the fate of
dataset()
andopen()
— currently theopen(dataset(...))
idiom is a bit of an awkward double step and leads to some ambiguities. Perhaps we could repurposedataset(name)
to mean whatopen(dataset(name))
currently does? - Perhaps unexport
DataSet
? Users should rarely need to use this directly. - Storage API; finalize how we're going to deal with "resources" which back a lazily downloaded dataset: cache mangement, etc. We could adopt the approach from ResourceContexts.jl, for example using
ctx = ResourceContext(); x = dataset(ctx, "name"); ...; close(ctx)
. Or from ContextManagers.jl in the stylectx = dataset("name"); x = value(ctx); close(ctx)
. (Both of these have macros for syntactic shortcuts.) - Improve and formalize the
BlobTree
API - Figure out how we can integrate with
FilePathsBase
and whether there's a type which can implement theAbstractPath
interface well enough to allow things likeCSV.read(x)
to work for somex
. Perhaps we need aDataSpecification
type for the URI-like concept currently called "dataspec" in the codebase? We could haveCSV.read(data"foo?version=2#a/b")
? - Consider deprecating and removing the "data entry point" stuff
@datarun
and@datafunc
. I feel introducing these was premature and the semantics is probably not quite right. We can bring something similar back in the future if it seems like a good idea. - Fix some issues with Data.toml
- Consider representing
[datasets]
section as a dictionary mapping names to configs, not as an array withname
properties. This is safe becauseTOML
syntax does allow arbitrary strings as section names. (Note that either representation is valid when a givenDataSet
is specifically tied to a project.) - Move data storage driver type outside of the storage section?
- Fix up the mess with
@__DIR__
templating somehow (fixed in DataSet configuration #46)
- Consider representing
- Dataset resolution
- Rename
DataSets.PROJECT
toDataSets.PROJECTS
if this is always aStackedDataProject
. - Consider whether we really want a data stack vs how "data authorities" could perhaps work (ie, the authority section in the URI; eg, juliahub.com)
- Rename
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Metadata
Assignees
Labels
No labels