support multiple storage locations #4396

pameyer · 2018-01-03T20:54:04Z

In the spirit of breaking things into smaller chunks; first step towards DLM integration would be to extend Dataverse to know about data files that are stored in multiple "storage locations"

related: #3403

pameyer · 2018-01-10T22:17:18Z

What does a user story for this look like?

As a member of the research community, I recognize that the files comprising datasets can and should exist in different storage locations, which may be accessible by different protocols internally (by the repository) and externally (by users). I would like this information to be present in Dataverse, so that researchers can access (and compute on) the datafiles using the storage location and access protocol most efficient for their needs; and so that repository administrators and make use of external components to orchestrate this.

matthew-a-dunlap · 2018-03-23T18:13:56Z

Questions I have about this:

What level of granularity do we see needed for saving to storage locations? Do we see users being able to choose different storage locations for different files upon saving? Or datasets will have a storage location attached to themselves to decide where the files will go?
Currently, each dvobject has a storage identifier. If we move forward with storage locations, it seems we will likely end up with a table for the different locations. Those locations will likely include the storage type. If this is true, do we think we may end up removing storage identifier and instead roll that info into locations?

I realize these questions point to work that is not going to be done in this story, but I'm trying to understand the architecture needed.

matthew-a-dunlap · 2018-03-23T18:23:46Z

If we decide to remove storage identifier (at some point, maybe not this story) and do this through a more verbose storage location table, this is what I see as needed:

a new row in dvobject for the storage location
a new table for storing the storage locations. Including a url, type of location, name, etc.
maybe a system variable to set the default storage location? Maybe this should be per dataverse? I need to learn more about use cases.
Code to add the storage location to files as they are added to the system
Code to retroactively add storage locations to existing files based upon their storage identifier (I'm a bit unsure about this).

pameyer · 2018-03-23T19:09:58Z

@matthew-a-dunlap Good questions; this does need to be clarified a little.

For level of granularity; repository/installation level for storage sites; and dataset-level for what dvobjects are stored at which sites.
Current thinking is that all data file depositions will go to an internal/primary storage site, so this won't be something a user has to select.

I may not be following all of the storage identifier points - my thinking is that the abstract data structure would be a sparse matrix with sites in one dimension and datasets in the other. This would need another table for storing information about the sites themselves (storage type, transfer protocols, name, etc).

I'll check some of the earlier DLM docs / specs to see if there's anything helpful.

matthew-a-dunlap · 2018-03-27T18:51:14Z

I'm taking myself off this story as I think these prov fixes will take me today and tomorrow (hopefully not more). I'll pick this up after that if it is still open!

matthew-a-dunlap · 2018-03-27T20:10:58Z

We discussed this story a bit during our regular technical discussion meeting.

We need two tables. One that contains all possible storage locations for the installation and one that joins locations to individual dv objects
We need to keep track of the primary identifier.
- This definitely means that we need to know the location Dataverse will go to when being asked to generally serve a file. This is probably per datafile
- We also need to know the storage location to save to when Dataverse saves a file through normal means. We did not discuss whether this is would be at a Dataset or Dataverse level.
Open question of what to do with the current storage identifier. Do we keep it in place as primary, change it or get rid of it?

Some more questions that came up after the talk:

Do we need control of where files will be replicated at the file level, or the Dataset/Dataverse level?
Should our tables be tracking both sites Dataverse can talk to directly and ones it cannot? If so, we probably need to distinguish the two.
We are designing this foremost for DLM, but what other functionality are we looking to enable with this as the foundation?
- Backups

Also, here are some DLM docs. Note they are fairly old but capture the need:

sekmiller · 2018-04-02T14:47:58Z

Based on my reading of the issue and review of the documents I've come up with a proposed table design and some assumptions about the workflow. I'd like to get some feedback from @akio-sone, @landreev, @pameyer and anyone else who would like to weigh in:

Support Multiple Storage Locations Table Design:

A single file can be stored in multiple locations. In order for Dataverse to keep track of the locations where files are stored, we will need two new tables. The first table will represent a storage location and the second that links a datafile to one or more storage locations.

Storage Location table:
Id (Long) identifier for the storage location
Name (String) Human recognizable name for the location - to be displayed by Dataverse (on the File Page or Dataset page?)
Description
URL (String) storage location url for depositing retrieving files stored
Type
StorageDriver
TransferProtocols (String) Protocols for depositing/retrieving files.
PrimaryWriteLocation
Additional fields needed?

Join table:
StorageLocation_Id
DvObject_id (Datafile level)
StorageLocationAddress(info on where to find file within the Storage Location)
PrimaryLocation (Boolean true if this is where Dataverse should serve a download from by default.)

Workflow Assumptions:

For now Storage Location records will be created via the API.
Join table records for the primary(default) location will be written by Dataverse. Additional records for outside locations will be written by the API.
Current file storage will be represented by one record in the storage location table and one record on the join table to represent each file (migration of records will happen at release time.)

- rename table and add primaryStorage and transferProtocols - rules around primaryStorage - various cleanup

pdurbin · 2018-04-17T20:28:37Z

@pameyer I attempted to implement all the changes we talked about (plus some clean up) in 47dde01. Please take a look at pull request #4585 and if you feel like it's ready for QA, I'd say you can go ahead and move it over. If there's anything else you need, please kick it back to me.

pdurbin · 2018-04-18T17:02:23Z

@pameyer I just made various improvements and fixes we talked about this morning. Can you please take another look? Thanks!

pameyer · 2018-04-19T14:33:41Z

@pdurbin Looks good to me (and as a bonus, less XHTML on errors). Thanks!

…ons #4396 Conflicts (new methods added in same place on both branches): src/test/java/edu/harvard/iq/dataverse/api/UtilIT.java

pdurbin · 2018-04-19T14:48:50Z

@pameyer sure, I just merged the latest from "develop" in the pull request in e776b85. Thanks for moving this to QA.

@kcondon in terms of how to test, the changes to the guides are a good starting point. Please note that the :ReplicationSites database setting has gone away (it was added in #3998 and there are screenshots in that issue of rsync instructions). We are now using a proper database table called storagesite to store these entries and populating that table with our usual pattern of sending JSON to an API endpoint with curl. We've implemented some rules especially around the primaryLocation boolean, that there can only be one primary location. The storage sites are ordered by id in the UI. I'm happy to write more in the guides if you want me to.

kcondon · 2018-04-20T19:55:59Z

Was not able to complete testing due to: #4605

pameyer added the Collaboration: SBGrid label Jan 3, 2018

pdurbin mentioned this issue Jan 11, 2018

Data Locality Module (DLM) support (an alternative to LOCKSS?) #3403

Closed

djbrooke added Status: Backlog and removed Status: Backlog labels Jan 24, 2018

scolapasta added Status: Backlog and removed Status: This/Next Sprint labels Mar 12, 2018

djbrooke added Status: This/Next Sprint and removed Status: Backlog labels Mar 21, 2018

matthew-a-dunlap self-assigned this Mar 22, 2018

matthew-a-dunlap added Status: Development and removed Status: This/Next Sprint labels Mar 22, 2018

matthew-a-dunlap removed their assignment Mar 27, 2018

djbrooke added Status: This/Next Sprint and removed Status: Development labels Mar 28, 2018

sekmiller added Status: Development and removed Status: This/Next Sprint labels Mar 28, 2018

sekmiller self-assigned this Mar 28, 2018

djbrooke assigned pameyer Apr 3, 2018

pameyer added the Status: Code Review label Apr 13, 2018

pameyer removed their assignment Apr 16, 2018

djbrooke added Status: Development and removed Status: Code Review labels Apr 17, 2018

pdurbin self-assigned this Apr 17, 2018

pdurbin added a commit that referenced this issue Apr 17, 2018

implement feedback from code review #4396

47dde01

- rename table and add primaryStorage and transferProtocols - rules around primaryStorage - various cleanup

pdurbin added Status: Code Review and removed Status: Development labels Apr 17, 2018

pdurbin removed their assignment Apr 17, 2018

pameyer self-assigned this Apr 18, 2018

pdurbin added a commit that referenced this issue Apr 18, 2018

generic hostname #4326 and fix JSON key #4396

6f51059

pdurbin added a commit that referenced this issue Apr 18, 2018

more javadoc #4396

088adb0

pameyer removed their assignment Apr 18, 2018

pdurbin added a commit that referenced this issue Apr 18, 2018

better error handling for transferProtocols #4396

8f2f629

pdurbin added a commit that referenced this issue Apr 18, 2018

disallow more than one primary storage site #4396

0eafc35

pdurbin assigned pameyer Apr 19, 2018

pameyer removed their assignment Apr 19, 2018

pameyer added Status: QA and removed Status: Code Review labels Apr 19, 2018

pdurbin added a commit that referenced this issue Apr 19, 2018

Merge branch 'develop' into 4396-support-multiple-file-storage-locati…

e776b85

…ons #4396 Conflicts (new methods added in same place on both branches): src/test/java/edu/harvard/iq/dataverse/api/UtilIT.java

pdurbin added a commit that referenced this issue Apr 19, 2018

increase code coverage to 100% for StorageSiteUtil #4396

b8d07d5

kcondon self-assigned this Apr 19, 2018

kcondon closed this as completed Apr 25, 2018

kcondon removed the Status: QA label Apr 25, 2018

pdurbin added this to the 4.9 - Persistent Identifiers for Files milestone Apr 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support multiple storage locations #4396

support multiple storage locations #4396

pameyer commented Jan 3, 2018

pameyer commented Jan 10, 2018

matthew-a-dunlap commented Mar 23, 2018

matthew-a-dunlap commented Mar 23, 2018

pameyer commented Mar 23, 2018

matthew-a-dunlap commented Mar 27, 2018

matthew-a-dunlap commented Mar 27, 2018 •

edited

Loading

sekmiller commented Apr 2, 2018 •

edited

Loading

pdurbin commented Apr 17, 2018

pdurbin commented Apr 18, 2018

pameyer commented Apr 19, 2018

pdurbin commented Apr 19, 2018

kcondon commented Apr 20, 2018

support multiple storage locations #4396

support multiple storage locations #4396

Comments

pameyer commented Jan 3, 2018

pameyer commented Jan 10, 2018

matthew-a-dunlap commented Mar 23, 2018

matthew-a-dunlap commented Mar 23, 2018

pameyer commented Mar 23, 2018

matthew-a-dunlap commented Mar 27, 2018

matthew-a-dunlap commented Mar 27, 2018 • edited Loading

sekmiller commented Apr 2, 2018 • edited Loading

pdurbin commented Apr 17, 2018

pdurbin commented Apr 18, 2018

pameyer commented Apr 19, 2018

pdurbin commented Apr 19, 2018

kcondon commented Apr 20, 2018

matthew-a-dunlap commented Mar 27, 2018 •

edited

Loading

sekmiller commented Apr 2, 2018 •

edited

Loading