Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support multiple storage locations #4396

Closed
pameyer opened this issue Jan 3, 2018 · 20 comments
Closed

support multiple storage locations #4396

pameyer opened this issue Jan 3, 2018 · 20 comments
Assignees

Comments

@pameyer
Copy link
Contributor

pameyer commented Jan 3, 2018

In the spirit of breaking things into smaller chunks; first step towards DLM integration would be to extend Dataverse to know about data files that are stored in multiple "storage locations"

related: #3403

@pameyer
Copy link
Contributor Author

pameyer commented Jan 10, 2018

What does a user story for this look like?

As a member of the research community, I recognize that the files comprising datasets can and should exist in different storage locations, which may be accessible by different protocols internally (by the repository) and externally (by users). I would like this information to be present in Dataverse, so that researchers can access (and compute on) the datafiles using the storage location and access protocol most efficient for their needs; and so that repository administrators and make use of external components to orchestrate this.

@matthew-a-dunlap
Copy link
Contributor

Questions I have about this:

  • What level of granularity do we see needed for saving to storage locations? Do we see users being able to choose different storage locations for different files upon saving? Or datasets will have a storage location attached to themselves to decide where the files will go?
  • Currently, each dvobject has a storage identifier. If we move forward with storage locations, it seems we will likely end up with a table for the different locations. Those locations will likely include the storage type. If this is true, do we think we may end up removing storage identifier and instead roll that info into locations?

I realize these questions point to work that is not going to be done in this story, but I'm trying to understand the architecture needed.

@matthew-a-dunlap
Copy link
Contributor

If we decide to remove storage identifier (at some point, maybe not this story) and do this through a more verbose storage location table, this is what I see as needed:

  • a new row in dvobject for the storage location
  • a new table for storing the storage locations. Including a url, type of location, name, etc.
  • maybe a system variable to set the default storage location? Maybe this should be per dataverse? I need to learn more about use cases.
  • Code to add the storage location to files as they are added to the system
  • Code to retroactively add storage locations to existing files based upon their storage identifier (I'm a bit unsure about this).

@pameyer
Copy link
Contributor Author

pameyer commented Mar 23, 2018

@matthew-a-dunlap Good questions; this does need to be clarified a little.

  • For level of granularity; repository/installation level for storage sites; and dataset-level for what dvobjects are stored at which sites.
  • Current thinking is that all data file depositions will go to an internal/primary storage site, so this won't be something a user has to select.

I may not be following all of the storage identifier points - my thinking is that the abstract data structure would be a sparse matrix with sites in one dimension and datasets in the other. This would need another table for storing information about the sites themselves (storage type, transfer protocols, name, etc).

I'll check some of the earlier DLM docs / specs to see if there's anything helpful.

@matthew-a-dunlap
Copy link
Contributor

I'm taking myself off this story as I think these prov fixes will take me today and tomorrow (hopefully not more). I'll pick this up after that if it is still open!

@matthew-a-dunlap matthew-a-dunlap removed their assignment Mar 27, 2018
@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Mar 27, 2018

We discussed this story a bit during our regular technical discussion meeting.

  • We need two tables. One that contains all possible storage locations for the installation and one that joins locations to individual dv objects
  • We need to keep track of the primary identifier.
    • This definitely means that we need to know the location Dataverse will go to when being asked to generally serve a file. This is probably per datafile
    • We also need to know the storage location to save to when Dataverse saves a file through normal means. We did not discuss whether this is would be at a Dataset or Dataverse level.
  • Open question of what to do with the current storage identifier. Do we keep it in place as primary, change it or get rid of it?

Some more questions that came up after the talk:

  • Do we need control of where files will be replicated at the file level, or the Dataset/Dataverse level?
  • Should our tables be tracking both sites Dataverse can talk to directly and ones it cannot? If so, we probably need to distinguish the two.
  • We are designing this foremost for DLM, but what other functionality are we looking to enable with this as the foundation?
    • Backups

Also, here are some DLM docs. Note they are fairly old but capture the need:

@sekmiller
Copy link
Contributor

sekmiller commented Apr 2, 2018

Based on my reading of the issue and review of the documents I've come up with a proposed table design and some assumptions about the workflow. I'd like to get some feedback from @akio-sone, @landreev, @pameyer and anyone else who would like to weigh in:

Support Multiple Storage Locations Table Design:

A single file can be stored in multiple locations. In order for Dataverse to keep track of the locations where files are stored, we will need two new tables. The first table will represent a storage location and the second that links a datafile to one or more storage locations.

Storage Location table:
Id (Long) identifier for the storage location
Name (String) Human recognizable name for the location - to be displayed by Dataverse (on the File Page or Dataset page?)
Description
URL (String) storage location url for depositing retrieving files stored
Type
StorageDriver
TransferProtocols (String) Protocols for depositing/retrieving files.
PrimaryWriteLocation
Additional fields needed?

Join table:
StorageLocation_Id
DvObject_id (Datafile level)
StorageLocationAddress(info on where to find file within the Storage Location)
PrimaryLocation (Boolean true if this is where Dataverse should serve a download from by default.)

Workflow Assumptions:

For now Storage Location records will be created via the API.
Join table records for the primary(default) location will be written by Dataverse. Additional records for outside locations will be written by the API.
Current file storage will be represented by one record in the storage location table and one record on the join table to represent each file (migration of records will happen at release time.)

@pameyer pameyer removed their assignment Apr 16, 2018
@pdurbin pdurbin self-assigned this Apr 17, 2018
pdurbin added a commit that referenced this issue Apr 17, 2018
- rename table and add primaryStorage and transferProtocols
- rules around primaryStorage
- various cleanup
@pdurbin
Copy link
Member

pdurbin commented Apr 17, 2018

@pameyer I attempted to implement all the changes we talked about (plus some clean up) in 47dde01. Please take a look at pull request #4585 and if you feel like it's ready for QA, I'd say you can go ahead and move it over. If there's anything else you need, please kick it back to me.

@pdurbin pdurbin removed their assignment Apr 17, 2018
@pameyer pameyer self-assigned this Apr 18, 2018
pdurbin added a commit that referenced this issue Apr 18, 2018
@pameyer pameyer removed their assignment Apr 18, 2018
@pdurbin
Copy link
Member

pdurbin commented Apr 18, 2018

@pameyer I just made various improvements and fixes we talked about this morning. Can you please take another look? Thanks!

@pameyer
Copy link
Contributor Author

pameyer commented Apr 19, 2018

@pdurbin Looks good to me (and as a bonus, less XHTML on errors). Thanks!

@pameyer pameyer removed their assignment Apr 19, 2018
pdurbin added a commit that referenced this issue Apr 19, 2018
…ons #4396

Conflicts (new methods added in same place on both branches):
src/test/java/edu/harvard/iq/dataverse/api/UtilIT.java
@pdurbin
Copy link
Member

pdurbin commented Apr 19, 2018

@pameyer sure, I just merged the latest from "develop" in the pull request in e776b85. Thanks for moving this to QA.

@kcondon in terms of how to test, the changes to the guides are a good starting point. Please note that the :ReplicationSites database setting has gone away (it was added in #3998 and there are screenshots in that issue of rsync instructions). We are now using a proper database table called storagesite to store these entries and populating that table with our usual pattern of sending JSON to an API endpoint with curl. We've implemented some rules especially around the primaryLocation boolean, that there can only be one primary location. The storage sites are ordered by id in the UI. I'm happy to write more in the guides if you want me to.

@kcondon
Copy link
Contributor

kcondon commented Apr 20, 2018

Was not able to complete testing due to: #4605

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants