-
Notifications
You must be signed in to change notification settings - Fork 3
Description
One issue that is see with implementing create/update/delete operations (#31, #38) here in DataSets is that different data repositories may have a very different ideas of how to execute them, and may require repository-specific information.
A case in point: the TOML-based data repos generally just link to data. Should deletion delete the linked file, or just the metadata? If you create a new dataset, where will the file be? Do you need to pass some options?
One design goal of DataSets is that it provides a universal, relocatable interface. So if you create datasets in a script, that should work consistently, even if you move to a different repository. But if you have to pass repository-specific options, that breaks that principle.
To provide create/update/delete functionality in a generic way, we could have the notion of managed datasets. Basically, the data repository fully owns and controls the storage. When you create a dataset, you essentially just hand it over to the repository, and as the user you can not exercise any more control in your script.
For a remote, managed storage of datasets, this is how it must work by definition. But we should also have this for the local Data.toml
-based repositories. I imagine that your repository would manage a directory somewhere where the data actually gets store, e.g.:
my-data-project/Project.toml
/Data.toml
/.datasets/<uuid1-for-File>
/.datasets/<uuid2-for-FileTree>/foo.csv
/.datasets/<uuid2-for-FileTree>/bar.csv
If now you create a dataset in a local project from a file with something like
DataSets.create("new-ds-name", "local/file.csv")
it will generate a UUID for it and just copy it to .datasets/<uuid>
. This way we also do not have any problems with e.g. trying to infer destination file names and running into conflicts.
A few closing thoughts:
- A data repo might not support managed datasets at all. That's fine, you just can't create/update/delete datasets then, just read existing ones. It may also have some datasets that are unmanaged, even if it otherwise does support them.
- All "linked" datasets in a TOML file would be unmanaged, and hence read-only. It would even be worth implementing them via a separate storage driver, in order not to conflate it with the implementation for standard datasets. Not sure about an API for creating such a dataset -- it probably would have to be specific to a data repo, because such a dataset only make sense for some repositories.
- You might be able to convert linked datasets into managed ones though, which will copy it to the repositories storage (whatever that may be).