-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data "recipes" #337
Comments
Ruminating on this a bit more, I'm thinking that maybe a new functional data management + content-addressed storage package which Dr. Watson could itself use may be the best idea. |
@sebastianpech If I understood this issue correctly, it seems like to be achievable with your Metadata approach, right? |
@tecosaur please see https://github.com/sebastianpech/DrWatsonSim.jl and let us know what you think. We plan to integrate this to DrWatson very soon. |
Just taking a quick peek, it looks like a partial solution, but I don't think goes nearly far enough. |
Having thought a bit more, looked at some data-management related tools, and talked to a friend who also looked at Dr. Watson and didn't think it would quite work for them either, I think I have a clearer idea on what could be a beneficial direction to pursue. It's too late in my timezone for me to go into detail, I'm just putting this here as a reminder for now. Short version: more work for smooth data creation, management, and processing, but I think the results could be worthwhile. Content addressed storage could be good, as well as (optional) integration with git-annex / datalad (which themselves use CAS). A |
I don't think the DrWatsonSim.jl functionalities are related to this problem. It would rather be an extension to support remote files in DrWatson and allow tracking of versions and changes of those files. So basically using |
Okay, but the description of #337 (comment) seems much more like something for CaosDB or some other multi-project database management system (DrWatson's functionality is all for single projects)
Why...? So far it is wonderful that everything in DrWatson works flawlessly with simple function calls. Needing a different REPL model is like using command line tools in my eyes. Which is definitely less flexible than calling functions.
That would break the unique identification of a project via a single Project.toml file, and hence break reproducibility however. |
Ok, it's now a reasonable hour for me, so time to actually explain my thoughts in my earlier comment. MotivationI get the impression from the existence of this repo under By contrast, my work is basically entirely data processing and analysis. As such, my main concerns are along the lines of:
My musings are on how this could be enabled in DrWatson, in a fairly easy and effective manner. The experience I'm envisioningDataOne would specify data sources (e.g. files on the computer, downloaded from a URL, etc.). Datasets would be constructed from data sources and other datasets, preferably using pure functions. Data sources and data sets would be recorded in a manifest file of sorts, providing a record of the state and relation of each data source/set. For example: [[boston_housing_data_csv]]
type = "URL"
location = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
version = "0.0.0"
cache = true
uuid = "49547965-d3ad-d314-7841-a2057a38ab48"
hash = 0x31b206cb9eecfef4
[[boston_housing_data]]
type = "dataset"
inputs = ["boston_housing_data_csv"]
version = "0.0.0"
cache = true
uuid = "80eca2fa-a33a-df0e-1e4d-430b1fa3b0eb"
hash = 0xb8d0227e6ca2467a For data sources, the For datasets, the Data sources and data sets can be simply saved/cached using their hash as the filename (i.e. a form of content-addressed storage). This seems like it would be quite straightforward and robust to me. It would probably also be worth having a method that lists/removes all 'orphan' files (i.e. where their name does not correspond to a data source/set in the manifest). For version control, large/binary files (such as many data sources/sets) can be a bit problematic. With this method, the manifest + dataset construction functions should be enough to reproducibly construct datasets. The other approach is to 'simply' check in your data files too, but this has issues.
DrWatson could however make use of git-annex/datalad if they are already installed/used to allow for a bit of a "best of both worlds" solution. Once a data source is acquired or dataset constructed, it could be checked in using git-annex, and then when requesting a data source/set that isn't saved/cached locally we could check if git-annex knows of an available copy first. I can see this being particularly useful for enabling workflows with some large/expensive to process datasets, where one needs to do the processing of a particular dataset on a different computer (e.g. a server with hundreds of threads and a terabyte of memory) and then fetch the result. For working with this, I'd think a module a bit like DrWatson.Data.addsource("boston_housing_data_csv", "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"; kwargs...)
DrWatson.Data.createdataset("NAME") Like
Is just this worth a REPL mode? I don't think by itself, but I have more thoughts on where this could be similarly handy (see later). AnalysisI said I mainly do data analysis, didn't I? Now we've dealt with the first half of this, let's move on to the second. It's worth noting that while I feel my thoughts on what a good approach for data would be are settling, my thoughts on analysis are still nascent. I think it would be with having some sort of register of analysis methods, and a way of asking DrWatson to apply a particular method to a particular dataset. This way DrWatson can record the hash of the dataset and version of the analysis method when generating the results, and thus be able to determine when results are out of date. How analysis methods should be handled by themselves is something I'm currently unsure of (file per method, hash the file? I'm not sure). However, I think application could look something like this DrWatson.Data.analysis("METHOD")(DrWatson.Data.get("DATASET")) once again, I think a shorthand in the REPL could be nice
A REPL mode also allows for conveniences such as tab completion, which I know I'd appreciate. InteroperabilityIt's not inconceivable that one may want/need to use some data/result with another tool, so it would be nice to have an easy way to ask DrWatson to produce the data/result file if it doesn't exist. This is another half-formed idea, but I think something like a convenient way of running ReuseWhen trying to improve an existing method, it can be useful to compare the new method to the old. One way of doing this is just by copying the code, but this feels like a road to "oops, I missed a function defined in another file", "oops, I needed an auxiliary data set", etc. As such, I think it would be rather nice if DrWatson provided a way to say "hey, there's this other DrWatson project at PATH with UID (autodetected) and I'd like to get the result of running its METHOD on my DATA". Once again, I'd imagine the use of this would be with a few functions, but I'd also think there could be a convenient REPL experience with something like:
That's it for now. Let me know what your thoughts are! |
@Datseris @sebastianpech if either of you have thoughts on the system I've outlined above I'd love to hear them. |
(super busy with other tasks, won't have time to reply soon. but will reply at some point!) |
Thanks, I appreciate you letting me know this hasn't slipped by 🙂. I feel like there could be a few ideas of value here and would really like to thrash them out to try to make the most of them. |
I think the https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html |
I've just stumbled across https://github.com/JuliaComputing/DataSets.jl. Perhaps @c42f might be interested in this discussion? (chris, see this comment first) |
@tecosaur this discussion is extremely relevant to work I've been doing in DataSets.jl. In many ways, I've already implemented these things, or at least thought about them and have some kind of plan :-) Your TOML file describing datasets declaratively is extremely similar in spirit (and somewhat in content!) to what already exists in DataSets.jl itself. My thoughts about versioning is that we need various pluggable data storage backends and that the data access API should support but not require versioning. After all, versioning is not essential for things like ephemeral data caches, and outright not supported by many important data sources. For example, downloading data from an ftp server, downloading a table from a transactional database, etc. We should be able to represent such datasets despite their lack of versioning! However when you want reproducibility you really do want versioning. In that case you should be able to opt into a data storage backend (like datalad, gin, dvc, or just plain git) which supports versioning in a first class way. For a REPL, DataSets.jl already has a Regarding data loading, I've done some prototyping of what I call declarative "data layers" in JuliaComputing/DataSets.jl#17 and I think this is similar to your Regarding analysis, I think this is somewhat out of scope for DataSets.jl — I feel analysis is something which is best tackled by a (DAG-based?) workflow engine of some kind. But you could use such a workflow engine alongside DataSets.jl and it would provide a way to store any necessary metadata. It might help to watch my very brief DataSets talk from JuliaCon last year to get another take on what it's all about: |
Ok, this is sounding quite promising I think. I'll have a look at that JuliaCon talk, but I get the impression that DataSets.jl's goals align with my own, but the particular usecase I have in mind isn't quite covered yet. If you see the future of DataSets.jl as something which encompasses my use case (as a sort of foundational package that is versatile enough that it can be well applied to many areas), perhaps it could be productive for us to have a chat at some point. Would you be up for that? |
Yes I'd be happy to chat. Can organize via the julialang slack or zulip if you're there? Also it would be useful if you open an issue on DataSets.jl to describe how you'd like to use it. Currently I'm working on new features (particularly improving the ability to programmatically create new datasets and write output data) so have some time available. |
That sounds good to me. I'll get in touch over Zulip. |
I think DataSets.jl looks very similar to the Kedro data catalog https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html, which is a great thing. Just in case you were looking for more inspiration @c42f
Agreed. Whether it's worth it to write Yet Another workflow engine in Julia is the question. |
@emiller88 Thanks for the mention of kedro, I'll have to read through that. Actually I've never seen a library like DataSets.jl before. In many ways that means it's been a hard design project! |
Okay, I've watched the Juliacon talk and gone through the docs of DataSets.jl. This is pretty much exactly what I wanted! It's very similar to Kedro's datacatalog, or the way I'm super excited because this solves my want, but I think this is just a piece of the bigger vision @tecosaur has. |
@Datseris since my initial comment, this has snow-balled and become a package I'll be presenting at JuliaCon. I'm currently in a polish/docs/tests stage of initial development, feel free to poke me on Slack/Zulip if you're interested in hearing more 🙂. |
what's the repo link? |
There are a few repos at this point, in varying states of documentation/polish/testing
I'd recommend looking at https://tecosaur.github.io/DataToolkit.jl/dev/ for a high-level view of the project. |
Cross-reference: DrWatson has been mentioned/asked about in the Discourse announcement of DataToolkit https://discourse.julialang.org/t/ann-datatoolkit-jl-reproducible-felexible-and-convenient-data-management/104757/2 |
Thanks for cross referencing!!! Weird, I somehow didn't get an email notification of that post. I will read this when I have some time on my hands and go through the JuliaCon talk as well. In the meantime, @tecosaur you said that this should be used in a DrWatson procject (I totally agree) and that it could be integrated more directly. If uou have ideas on that, would you mind opening a new issue exposing these ideas? (This issue is already quite lengthy and I think it is more useful to have a targeted discussion in a new issue). |
That's great to hear 😀, after spending so long putting this together I'm quite keen to "get it out there" and hope to see it start actually helping people the way I had in mind when designing/writing DataToolkit.
So, to my shame, I have actually yet to use DrWatson for anything non-trivial 😔. As such, I'm probably actually not the best to see how it could be best integrated. I suspect just using it separately (i.e. without any particular integration) would probably be a decent experience to start with. A direction along the lines of #255 might make sense? Oh, BTW this should also help with #186, and be particularly good with this concern:
Since DataToolkit's
Oh come on, it hasn't even hit triple digits yet 😛 /s More seriously, that does seem pretty reasonable to me. It's also funny (to me) to think that the first few comments here show the genesis of DataToolkit. A few long comments and then a year later there are four packages, ~14k LOC, a huge number of hours, and a JuliaCon talk 😄. |
Hello!
A little disclaimer to start with: I've only recently come across this project, and I'm trying to see if it can work for my needs, so please let me know if I've missed something obvious.
That said, I've been going through the documentation to see if Dr. Watson could help give a bit more structure and order to some work I'm currently undertaking, and I can't help but feel it just falls short at the moment.
Background
I'm currently doing a lot of work with some data sources, manipulating and combining them etc. To facilitate this, I've set up what I term "artefact recipes".
I'm hoping that by describing how they work, you may be able to tell me how this could be achieved with Dr. Watson, or inspire the addition of equivalent functionality.
How my "recipes" work
I have a global
Dict
of recipes, each named with aSymbol
, and their definition.Each recipe definition has a number of components:
This function should load the data, and return a Julia object
nothing
then no cache is used and loader is re-run every time this artefact is requested in a Julia session (useful if the result is something that won't cache that well, like a function)There are also two generated components of a recipe:
code_lowered
→string
, which is a bit dodgy but mostly works)Using an artefact definition, as described above, I can simply call
get_artefact(:name)
and then:Artefact dependencies in the
loader
/preprocessor
are automatically registered as:get_artifact
is called it registers the artefact loadedThe hash of each dependency is stored in the cachefile, and checked when the cachefile is loaded.
This is a bit long, but it should give a decent outline of the mechanism. For what it's worth, the code required is only ~200 lines, and that includes things like a 30 line pretty download function etc.
Some examples
Getting a CSV from a URL
Getting a gzip'd CSV from a URL, unzipping it, and loading a modified version
Getting a large gzip'd TSV from a URL, preprocessing it with streaming decompression, providing a function that efficiently accesses the processed data
Data derived from other artifacts
When loading this in the REPL, this is what I see:
Demo of loading an artifact with an out-of-date dependency
Closing comments
If such a feature doesn't currently exist in Dr. Watson, I think something like this would be a worthwhile addition as it allows for easy, effective, and reproducible data processing.
Thanks for reading this much-longer-than-I-thought-this-would-be issue! Please let me know if this can already be done, and what your thoughts are.
The text was updated successfully, but these errors were encountered: