-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support multiple storage locations #4396
Comments
What does a user story for this look like? As a member of the research community, I recognize that the files comprising datasets can and should exist in different storage locations, which may be accessible by different protocols internally (by the repository) and externally (by users). I would like this information to be present in Dataverse, so that researchers can access (and compute on) the datafiles using the storage location and access protocol most efficient for their needs; and so that repository administrators and make use of external components to orchestrate this. |
Questions I have about this:
I realize these questions point to work that is not going to be done in this story, but I'm trying to understand the architecture needed. |
If we decide to remove storage identifier (at some point, maybe not this story) and do this through a more verbose storage location table, this is what I see as needed:
|
@matthew-a-dunlap Good questions; this does need to be clarified a little.
I may not be following all of the storage identifier points - my thinking is that the abstract data structure would be a sparse matrix with sites in one dimension and datasets in the other. This would need another table for storing information about the sites themselves (storage type, transfer protocols, name, etc). I'll check some of the earlier DLM docs / specs to see if there's anything helpful. |
I'm taking myself off this story as I think these prov fixes will take me today and tomorrow (hopefully not more). I'll pick this up after that if it is still open! |
We discussed this story a bit during our regular technical discussion meeting.
Some more questions that came up after the talk:
Also, here are some DLM docs. Note they are fairly old but capture the need: |
Based on my reading of the issue and review of the documents I've come up with a proposed table design and some assumptions about the workflow. I'd like to get some feedback from @akio-sone, @landreev, @pameyer and anyone else who would like to weigh in: Support Multiple Storage Locations Table Design: A single file can be stored in multiple locations. In order for Dataverse to keep track of the locations where files are stored, we will need two new tables. The first table will represent a storage location and the second that links a datafile to one or more storage locations. Storage Location table: Join table: Workflow Assumptions: For now Storage Location records will be created via the API. |
- rename table and add primaryStorage and transferProtocols - rules around primaryStorage - various cleanup
@pameyer I just made various improvements and fixes we talked about this morning. Can you please take another look? Thanks! |
@pdurbin Looks good to me (and as a bonus, less XHTML on errors). Thanks! |
…ons #4396 Conflicts (new methods added in same place on both branches): src/test/java/edu/harvard/iq/dataverse/api/UtilIT.java
@pameyer sure, I just merged the latest from "develop" in the pull request in e776b85. Thanks for moving this to QA. @kcondon in terms of how to test, the changes to the guides are a good starting point. Please note that the |
Was not able to complete testing due to: #4605 |
In the spirit of breaking things into smaller chunks; first step towards DLM integration would be to extend Dataverse to know about data files that are stored in multiple "storage locations"
related: #3403
The text was updated successfully, but these errors were encountered: