Skip to content

The road to DataSets 1.0 #43

Open
Open

Description

Here's a rough list of items I'm considering on the path to a DataSets-1.0 release. Several of these can and should be done prior to version 1.0 in case the APIs need to be adjusted a bit before the 1.0 release.

  • Streamline access for small datasets by providing a "high level" API for use when working with a fully in-memory representation of the data which doesn't require the management of separate resources. ("Separate resources" would be things like managing an on-disk cache of the data, incremental async download/upload; that kind of thing.). Perhaps we can use the verbs load() / save() for this — thinking of DataSets.jl as a new FileIO.jl, I think this would make sense. (Actually, this isn't breaking, so it doesn't need to wait for 1.0.)
  • Somehow allow load() and save() to return some "default type the user cares about" for convenience. For example, returning a DataFrame for a tabular dataset. This will require addressing the problems of dynamically loading Julia modules that were partially faced in Data Layers #17
  • Consider the fate of dataset() and open() — currently the open(dataset(...)) idiom is a bit of an awkward double step and leads to some ambiguities. Perhaps we could repurpose dataset(name) to mean what open(dataset(name)) currently does?
  • Perhaps unexport DataSet? Users should rarely need to use this directly.
  • Storage API; finalize how we're going to deal with "resources" which back a lazily downloaded dataset: cache mangement, etc. We could adopt the approach from ResourceContexts.jl, for example using ctx = ResourceContext(); x = dataset(ctx, "name"); ...; close(ctx). Or from ContextManagers.jl in the style ctx = dataset("name"); x = value(ctx); close(ctx). (Both of these have macros for syntactic shortcuts.)
  • Improve and formalize the BlobTree API
  • Figure out how we can integrate with FilePathsBase and whether there's a type which can implement the AbstractPath interface well enough to allow things like CSV.read(x) to work for some x. Perhaps we need a DataSpecification type for the URI-like concept currently called "dataspec" in the codebase? We could have CSV.read(data"foo?version=2#a/b")?
  • Consider deprecating and removing the "data entry point" stuff @datarun and @datafunc. I feel introducing these was premature and the semantics is probably not quite right. We can bring something similar back in the future if it seems like a good idea.
  • Fix some issues with Data.toml
    • Consider representing [datasets] section as a dictionary mapping names to configs, not as an array with name properties. This is safe because TOML syntax does allow arbitrary strings as section names. (Note that either representation is valid when a given DataSet is specifically tied to a project.)
    • Move data storage driver type outside of the storage section?
    • Fix up the mess with @__DIR__ templating somehow (fixed in DataSet configuration #46)
  • Dataset resolution
    • Rename DataSets.PROJECT to DataSets.PROJECTS if this is always a StackedDataProject.
    • Consider whether we really want a data stack vs how "data authorities" could perhaps work (ie, the authority section in the URI; eg, juliahub.com)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions