The road to DataSets 1.0

Here's a rough list of items I'm considering on the path to a DataSets-1.0 release. Several of these can and should be done prior to version 1.0 in case the APIs need to be adjusted a bit before the 1.0 release.

* [ ] Streamline access for small datasets by providing a "high level" API for use when working with a fully in-memory representation of the data which doesn't require the management of separate resources. ("Separate resources" would be things like managing an on-disk cache of the data, incremental async download/upload; that kind of thing.). Perhaps we can use the verbs `load()` / `save()` for this — thinking of DataSets.jl as a new FileIO.jl, I think this would make sense. (Actually, this isn't breaking, so it doesn't need to wait for 1.0.)
* [ ] Somehow allow `load()` and `save()` to return some "default type the user cares about" for convenience. For example, returning a `DataFrame` for a tabular dataset. This will require addressing the problems of dynamically loading Julia modules that were partially faced in #17
* [ ] Consider the fate of `dataset()` and `open()` — currently the `open(dataset(...))` idiom is a bit of an awkward double step and leads to some ambiguities. Perhaps we could repurpose `dataset(name)` to mean what `open(dataset(name))` currently does?
* [ ] Perhaps unexport `DataSet`? Users should rarely need to use this directly.
* [ ] Storage API; finalize how we're going to deal with "resources" which back a lazily downloaded dataset: cache mangement, etc.  We could adopt the approach from ResourceContexts.jl, for example using `ctx = ResourceContext(); x = dataset(ctx, "name"); ...; close(ctx)`. Or from ContextManagers.jl in the style `ctx = dataset("name"); x = value(ctx); close(ctx)`.  (Both of these have macros for syntactic shortcuts.)
  * #27
  * #38
* [ ] Improve and formalize the `BlobTree` API
  * #41
  * API improvements from #38 
  * #42
* [ ] Figure out how we can integrate with `FilePathsBase` and whether there's a type which can implement the `AbstractPath` interface well enough to allow things like `CSV.read(x)` to work for some `x`. Perhaps we need a `DataSpecification` type for the URI-like concept currently called "dataspec" in the codebase? We could have `CSV.read(data"foo?version=2#a/b")`?
* [ ] Consider deprecating and removing the "data entry point" stuff `@datarun` and `@datafunc`. I feel introducing these was premature and the semantics is probably not quite right. We can bring something similar back in the future if it seems like a good idea.
* [ ] Fix some issues with Data.toml
  * [ ] Consider representing `[datasets]` section as a dictionary mapping names to configs, not as an array with `name` properties. This is safe because `TOML` syntax does allow arbitrary strings as section names. (Note that either representation is valid when a given `DataSet` is specifically tied to a project.)
  * [ ] Move data storage driver type outside of the storage section?
  * [x] Fix up the mess with `@__DIR__` templating somehow (fixed in #46)
* [ ] Dataset resolution
  * [ ] Rename `DataSets.PROJECT` to `DataSets.PROJECTS` if this is always a `StackedDataProject`.
  * [ ] Consider whether we really want a data stack vs how "data authorities" could perhaps work (ie, the authority section in the URI; eg, juliahub.com)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The road to DataSets 1.0 #43

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The road to DataSets 1.0 #43

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions