[WIP] Adds data stores and references #37
Conversation
| @@ -1,5 +1,6 @@ | |||
| pandas | |||
| six | |||
| retry | |||
There was a problem hiding this comment.
Looks like this hasn't been updated in 3 years. Is this something that we really need as a dependency? If so, is there a better maintained alternative.
There was a problem hiding this comment.
I could change it over to tenacity. I think I left this here by accident while I had GCS code in my dev, so might not need it in this repo at all.
There was a problem hiding this comment.
Cool. Let's double check and remove if possible.
|
|
||
| for manager in reversed(list(self.values())): | ||
| # Only check for managers here that can both encode and store | ||
| if not (manager.encoder_name and manager.store_name): |
There was a problem hiding this comment.
I found this function a little hard to read. I feel like it would be easier to read if this if-statement was inverted and the logic inside the try statement was placed inside it. Something like:
if manager.encoder_name and manager.store_name:
blahAssuming I understand that the intent of this function is to run through the list of valid managers and retrieve the ones that work with the scrap passed as a parameter.
There was a problem hiding this comment.
Good point. I'll take a close look at this section next pass
| if scrap.encoder is not None: | ||
| return self.get((scrap.encoder, None)) | ||
|
|
||
| for encoder in reversed(list(self.values())): |
There was a problem hiding this comment.
Curious: is there a reason you reverse the list of encoders here and elsewhere? Does it have something to do with the properties of an OrderedDict?
There was a problem hiding this comment.
Yes. You can't prefix to OrderedDicts easily, which we need to be able to do, it requires rebuilding the whole structure.
|
Found a few issues in the PR implementation when I tried to get the conflicts resolved. I'm going to close break this into some smaller PRs instead. |
This is still a WIP, more testing and internal API contracts need to be evaluated
In the glue api there is now a
storeanddata_pathoptional arguments. These allow for specifying the path one wishes to use in saving data, as well as the storage mechanism for that data (default "notebook"). In most cases to support remote saves the caller simply has to adddata_path="s3://mybucket/my/path/to/data.jsonto the glue call to save data to that remote path. When recalling the named data scrap the library will automatically know how to fetch and translate the data saved at the referenced path.Internally, In place of just specifying an encoder, there's now an encoder_name and a store_name ("json", "notebook") associated with data managers. Data managers can implement encoding or storing, or both capabilities. One can implement ("text", None) and (None, "s3") independently. This lets simple encoders return string or bytes that the stores can save while allowing for more complex store and recall mechanism to encode and store at the same time (e.g. dataframe to multi-file s3 parquet via ("arrow", "s3")).
To facilitate the contract changes there is now a scrapbook v2 schema. Loading v1 or v2 data is transparent to the library user.