-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BagIt Support - Add automatic checksum validation on upload #8677
BagIt Support - Add automatic checksum validation on upload #8677
Conversation
9414e42
to
c23f086
Compare
61b073a
to
9044fd4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving along.
The PR changes a large number of class files, but the changes are pretty straightforward.
(I may add a couple of words under "how to test").
@adaybujeda Would you refresh this branch from develop? We recently renamed a flyway script and this pr has the old name and so cannot deploy. Thanks! |
9044fd4
to
a88da4e
Compare
Rebase from Thanks! |
@adaybujeda Apologies, but do you have a test bag I can use? I tried one from Jim but it fails when I upload in UI. |
Hi @kcondon, we created a couple of BagIt packages to do internal testing. |
@adaybujeda Thanks for the sample files, they worked fine. I did have a question on the validation. It seems the failure examples complain about files not existing that are listed in the manifest rather than having bad checksums? I ask because when I edit the working bag file manifest-sha512.txt (sha512 is what my dataverse installation is using) and alter the checksums, it doesn't fail. What is it checking then? There is also a tagmanifest-sha512.txt |
Issues found/questions:
|
Thanks @kcondon.
For file validation, the backend will search the manifests provided in the zip and use the first one that the code can process. The checksums supported are controlled by
:BagValidatorMaxErrors setting is a best effort for the validation of the files. The processing of the checksums is done using a thread pool to improve performance for large files. When waiting for completion, it will check every 10 seconds to see if the :BagValidatorMaxErrors has been reached. If reached, it will stop processing and return. For small enough files, it will complete processing before the 10 seconds and all files will be processed. For the FE, we use :CreateDataFilesMaxErrorsToDisplay to control how many of these processing errors we want to show. |
What this PR does / why we need it:
It adds a new file handler to manage BagIt packages that are uploaded using a Zip file.
The first requirement is to detect that is a BagIt package, extract the files as they are and perform the checksum validation.
Which issue(s) this PR closes:
Special notes for your reviewer:
BagIt package detection: When uploading a zip file, the system will look for a zip entry called
bagit.txt
. Then within the same folder where that file is, it will look for a manifest file with a supported hash algorithm, likemanifest-sha256.txt
. If both are found, the zip file is deemed a BagIt package.Suggestions on how to test this:
Enable the feature:
curl -X PUT -d 'true' http://localhost:8080/api/admin/settings/:BagItHandlerEnabled
Upload a BagIt package as a Zip file. It should extract all files and perform the checksum validation.
Upload a BagIt package with invalid checksums. The upload should not be allowed and up to 5 errors should be highlighted in the UI.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
It adds a validation error message to the upload screen. Sample of the screenshot in the issue: #8608
Is there a release notes update needed for this change?:
It will be included in the changes
Additional documentation:
BagIt documentation will be added to the Dataverse guide.
This is part of the Harvard Data Commons project.