-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Capture Module (rsync support) #3145
Comments
For the DCM (and data uploads generally), "rsync" is short-hand for rsync over ssh. Client-side checksums are also part of the DCM: we'll have to decide how we want to handle the difference in hash functions (switching hashes or multi-hash support in Dataverse). Sorting that out might be out of scope for DCM MVP. |
@bmckinney and I met yesterday (notes at https://docs.google.com/document/d/1BSVqAqsc_KieqfFfg_CeKdV7HwDO1Y3UuC2VMO-RJDk/edit?usp=sharing ). I demo'ed https://github.com/pdurbin/dataverse/tree/3145-dcm and he's going to try to merge that branch with https://github.com/bmckinney/bio-dataverse/tree/feature/file-system-import so we can deploy the combined code at https://dv.sbgrid.org and hopefully get closer to a prototype of rsync support. I expect we'll need help from @pameyer to switch from my mock version of the Data Capture Module at https://github.com/sbgrid/data-capture-module/blob/master/api/dcm.py to more of the real thing. All code mentioned above is very preliminary at this point. We still need to meet with @landreev to discuss how to make rsync support compatible with file versioning. |
(Note to self, mostly) This is a parent issue of the items created and estimated in the 9/8 meeting, notes recorded here: https://docs.google.com/document/d/1wWSdKUOGA1L7UqFsgF3aOs8_9uyjnVpsPAxk7FObOOI/edit These will be created as new Github issues and linked here. |
@djbrooke thanks! Here are the related issues we created today:
|
A dependency for rsync support (#3145) is the ability to persist SHA-1 checksums for files rather than MD5 checksums. A new installation-wide configuration setting called ":FileFixityChecksumAlgorithm" has been added which can be set to "SHA-1" to have Dataverse calculate and show SHA-1 checksums rather than MD5 checksums. In order to run this branch you must run the provided SQL upgrade script: scripts/database/upgrades/3354-alt-checksum.sql In addition, the Solr schema should be updated to the version in this branch.
#3249 is highly related in that ultimately, end users will need to know how to download the data via rsync or whatever mechanism. The focus of this issue to date has been researchers uploading data, not end-users downloading it. |
These days we're working in small chunks. To follow along, start at the next small chunk that's currently in the backlog: #3942. Closing. |
@pameyer @bmckinney and I met yesterday to discuss what we're calling a "Data Capture Module" or "DCM" for short. http://guides.dataverse.org/en/4.3.1/installation/prep.html#architecture-and-components lists a number of optional components for Dataverse (Shibboleth, rApache, Rserve, Geoconnect, etc.) and "Data Capture Module" will be added to the list. The DCM's main role in the architecture is facilitating large file transfer (#952), especially via non-HTTP mechanism such as rsync.
The Minimum Viable Product (MVP) for the Data Capture Module includes support for rsync (#2960) but other mechanisms are under consideration such as Globus (#2728, #952), Aspera, and SFTP. https://data.sbgrid.org already supports rsync and we expect to be reusing code from that service, cleaning it up and generalizing it.
The task list for the Data Capture Module is still very much in flux but I'm creating this issue so that I have an issue number to associate a branch with as I start committing some code on the Dataverse side, especially API endpoints and the ability for Dataverse to talk to the DCM.
The text was updated successfully, but these errors were encountered: