-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: DataSets.jl intergration #144
base: master
Are you sure you want to change the base?
Conversation
Cool. I have not yet got to findout about DataSets.jl.
|
A good question to me. I'm not very sure if I understand the design goal of DataSets.jl, I originally thought it is a way to separate data source configuration and data source fetching implementation. But then I checked the JuliaComputing/DataSets.jl#6 and it seems that DataSets also wants to solve the how-to-open issue, for that part, DataDeps can't help at all. |
Very nice, thanks for this proof of concept!
Yes, this is one design goal. The benefit of having datasets integration is that you can use the DataSets API to process data which happens to be stored as a DataDep. But equally well use that same API to process a dataset that is stored in various other ways. For example, inside an Artifact, on local disk, fetched on demand from S3, etc.
This is true, but it's not a problem. IIUC a DataDep always presents data as a directory. That would be reflected into Julia as the
I think either here or in a 3rd package — DataSets itself can't take dependencies on all storage backends, as there's essentially an unlimited number of those. We could bless a few storage backends, but for now I've limited that to the filesystem and data embedded along with the metadata. |
Maybe DataSets just plain replaces DataDeps? e.g. a Artifact from BinaryBuilder is good for artifacts; |
I'm thinking of adding Downloads to DataSets.jl, and then provide a FileIO-like downloader registry (also in DataSets.jl) to support lazy package loading for the driver packages. Does this sound good to you? If this sounds good, then I now believe we need a new optional entry For downloading support, we may also need to consider some streaming cases via HTTP API. |
Yes, this is exactly the kind of thing DataSets is good for (also mapping in S3 prefixes as
DataSets already has something like this for drivers :-D See JuliaComputing/DataSets.jl#20 Note that it's very easy to run into world age issues with lazy loading.
From my point of view, the "downloader" is the "driver" — they're not separate. If that seems strange, consider that DataSets doesn't require immutable nor local data — its model is not "download everything one time and use the local cache after that". It could make sense to have a driver type which has "downloader" as a sub-key and makes some strong assumptions about the API of that downloader. But I suspect we'll need custom shim code for each downloader module so making this purely declarative might not work (then we're back to "just write a different driver for each downloader"). There's probably some common cache management code we should have if we're going to support multiple remote backends. BTW we've been discussing similar use cases over at JuliaComputing/DataSets.jl#26 (comment). I think it would be neat to unify the use cases of RemoteFiles.jl and DataDeps.jl within a common remote downloading interface in DataSets.jl. DataSets.jl can certainly take a dependency on Downloads.jl to make this all work! |
DataDeps is quite reliable to download the content, but it can sometimes be troublesome to manage a dataset registry if datasets information are hardcoded as Julia source code. I think DataSets.jl toml specification is quite promising to eventually provide a general registry for datasets, so maybe we can just integrate DataDeps into DataSets as a downloading driver.
This is just a proof of concept, I want to know your thoughts on this before I start polishing the details.
cc: @c42f