Skip to content

Commit

Permalink
BagIt Support - Add automatic checksum validation on upload
Browse files Browse the repository at this point in the history
  • Loading branch information
abujeda committed May 18, 2022
1 parent e94a78a commit 61b073a
Show file tree
Hide file tree
Showing 45 changed files with 2,829 additions and 21 deletions.
10 changes: 10 additions & 0 deletions doc/release-notes/8608-bagit-support-validate-checksums.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
## BagIt Support - Automatic checksum validation on zip file upload
The BagIt file handler detects and transforms zip files with a BagIt package format into Dataverse DataFiles. The system validates the checksums of the files in the package payload as described in the first manifest file with a hash algorithm that we support. Take a look at `BagChecksumType class <https://github.com/IQSS/dataverse/tree/develop/src/main/java/edu/harvard/iq/dataverse/util/bagit/BagChecksumType.java>`_ for the list of the currently supported hash algorithms.

The handler will not allow packages with checksum errors. The first 5 errors will be displayed to the user. This is configurable though database settings.

The checksum validation uses a thread pool to improve performance. This thread pool can be adjusted to your Dataverse installation requirements.

The BagIt file handler is disabled by default. Use the ``:BagItHandlerEnabled`` database settings to enable it: ``curl -X PUT -d 'true' http://localhost:8080/api/admin/settings/:BagItHandlerEnabled``

For more configuration settings see the user guide: https://guides.dataverse.org/en/latest/installation/config.html#bagit-file-handler
59 changes: 59 additions & 0 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1038,6 +1038,22 @@ Disabling Custom Dataset Terms

See :ref:`:AllowCustomTermsOfUse` for how to disable the "Custom Dataset Terms" option.

.. _BagIt File Handler:

BagIt File Handler
------------------

BagIt file handler detects and transforms zip files with a BagIt package format into Dataverse DataFiles. The system validates the checksums of the files in the package payload as described in the first manifest file with a hash algorithm that we support. Take a look at `BagChecksumType class <https://github.com/IQSS/dataverse/tree/develop/src/main/java/edu/harvard/iq/dataverse/util/bagit/BagChecksumType.java>`_ for the list of the currently supported hash algorithms.

The checksum validation uses a thread pool to improve performance. This thread pool can be adjusted to your Dataverse installation requirements.

BagIt file handler configuration settings:

- :ref:`:BagItHandlerEnabled`
- :ref:`:BagValidatorJobPoolSize`
- :ref:`:BagValidatorMaxErrors`
- :ref:`:BagValidatorJobWaitInterval`

.. _BagIt Export:

BagIt Export
Expand Down Expand Up @@ -2536,6 +2552,49 @@ To enable redirects to the zipper on a different server:

``curl -X PUT -d 'https://zipper.example.edu/cgi-bin/zipdownload' http://localhost:8080/api/admin/settings/:CustomZipDownloadServiceUrl``

:CreateDataFilesMaxErrorsToDisplay
++++++++++++++++++++++++++++++++++

Number of errors to display to the user when creating DataFiles from a file upload. It defaults to 5 errors.

``curl -X PUT -d '1' http://localhost:8080/api/admin/settings/:CreateDataFilesMaxErrorsToDisplay``

.. _:BagItHandlerEnabled:

:BagItHandlerEnabled
+++++++++++++++++++++

Part of the database settings to configure the BagIt file handler. Enables the BagIt file handler. By default, the handler is disabled.

``curl -X PUT -d 'true' http://localhost:8080/api/admin/settings/:BagItHandlerEnabled``

.. _:BagValidatorJobPoolSize:

:BagValidatorJobPoolSize
++++++++++++++++++++++++

Part of the database settings to configure the BagIt file handler. The number of threads the checksum validation class uses to validate a single zip file. Defaults to 4 threads

``curl -X PUT -d '10' http://localhost:8080/api/admin/settings/:BagValidatorJobPoolSize``

.. _:BagValidatorMaxErrors:

:BagValidatorMaxErrors
++++++++++++++++++++++

Part of the database settings to configure the BagIt file handler. The maximum number of errors allowed before the validation job aborts execution. This is to avoid processing the whole BagIt package. Defaults to 5 errors.

``curl -X PUT -d '2' http://localhost:8080/api/admin/settings/:BagValidatorMaxErrors``

.. _:BagValidatorJobWaitInterval:

:BagValidatorJobWaitInterval
++++++++++++++++++++++++++++

Part of the database settings to configure the BagIt file handler. This is the period in seconds to check for the number of errors during validation. Defaults to 10.

``curl -X PUT -d '60' http://localhost:8080/api/admin/settings/:BagValidatorJobWaitInterval``

:ArchiverClassName
++++++++++++++++++

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
package edu.harvard.iq.dataverse;

import edu.harvard.iq.dataverse.util.BundleUtil;
import edu.harvard.iq.dataverse.util.file.CreateDataFileResult;

import javax.ejb.Stateless;
import javax.inject.Inject;
import java.util.List;
import java.util.Optional;
import java.util.stream.Collectors;

/**
*
* @author adaybujeda
*/
@Stateless
public class EditDataFilesPageHelper {

public static final String MAX_ERRORS_TO_DISPLAY_SETTING = ":CreateDataFilesMaxErrorsToDisplay";
public static final Integer MAX_ERRORS_TO_DISPLAY = 5;

@Inject
private SettingsWrapper settingsWrapper;

public String getHtmlErrorMessage(CreateDataFileResult createDataFileResult) {
List<String> errors = createDataFileResult.getErrors();
if(errors == null || errors.isEmpty()) {
return null;
}

Integer maxErrorsToShow = settingsWrapper.getInteger(EditDataFilesPageHelper.MAX_ERRORS_TO_DISPLAY_SETTING, EditDataFilesPageHelper.MAX_ERRORS_TO_DISPLAY);
if(maxErrorsToShow < 1) {
return null;
}

String typeMessage = Optional.ofNullable(BundleUtil.getStringFromBundle(createDataFileResult.getBundleKey())).orElse("Error processing file");
String errorsMessage = errors.stream().limit(maxErrorsToShow).map(text -> String.format("<li>%s</li>", text)).collect(Collectors.joining());
return String.format("%s:<br /><ul>%s</ul>", typeMessage, errorsMessage);
}
}
26 changes: 21 additions & 5 deletions src/main/java/edu/harvard/iq/dataverse/EditDatafilesPage.java
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,8 @@
import javax.faces.view.ViewScoped;
import javax.inject.Inject;
import javax.inject.Named;

import edu.harvard.iq.dataverse.util.file.CreateDataFileResult;
import org.primefaces.event.FileUploadEvent;
import org.primefaces.model.file.UploadedFile;
import javax.json.Json;
Expand Down Expand Up @@ -143,6 +145,8 @@ public enum Referrer {
LicenseServiceBean licenseServiceBean;
@Inject
DataFileCategoryServiceBean dataFileCategoryService;
@Inject
EditDataFilesPageHelper editDataFilesPageHelper;

private Dataset dataset = new Dataset();

Expand Down Expand Up @@ -1485,7 +1489,9 @@ public void handleDropBoxUpload(ActionEvent event) {
// for example, multiple files can be extracted from an uncompressed
// zip file.
//datafiles = ingestService.createDataFiles(workingVersion, dropBoxStream, fileName, "application/octet-stream");
datafiles = FileUtil.createDataFiles(workingVersion, dropBoxStream, fileName, "application/octet-stream", null, null, systemConfig);
CreateDataFileResult createDataFilesResult = FileUtil.createDataFiles(workingVersion, dropBoxStream, fileName, "application/octet-stream", null, null, systemConfig);
datafiles = createDataFilesResult.getDataFiles();
errorMessage = editDataFilesPageHelper.getHtmlErrorMessage(createDataFilesResult);

} catch (IOException ex) {
this.logger.log(Level.SEVERE, "Error during ingest of DropBox file {0} from link {1}", new Object[]{fileName, fileLink});
Expand Down Expand Up @@ -1739,6 +1745,10 @@ public void uploadFinished() {
uploadedFiles.clear();
uploadInProgress.setValue(false);
}
if(errorMessage != null) {
FacesContext.getCurrentInstance().addMessage(null, new FacesMessage(FacesMessage.SEVERITY_ERROR, BundleUtil.getStringFromBundle("dataset.file.uploadFailure"), errorMessage));
PrimeFaces.current().ajax().update(":messagePanel");
}
// refresh the warning message below the upload component, if exists:
if (uploadComponentId != null) {
if (uploadWarningMessage != null) {
Expand Down Expand Up @@ -1787,6 +1797,7 @@ public void uploadFinished() {
multipleDupesNew = false;
uploadWarningMessage = null;
uploadSuccessMessage = null;
errorMessage = null;
}

private String warningMessageForFileTypeDifferentPopUp;
Expand Down Expand Up @@ -1937,6 +1948,7 @@ private void handleReplaceFileUpload(String fullStorageLocation,
}

private String uploadWarningMessage = null;
private String errorMessage = null;
private String uploadSuccessMessage = null;
private String uploadComponentId = null;

Expand Down Expand Up @@ -2005,8 +2017,10 @@ public void handleFileUpload(FileUploadEvent event) throws IOException {
try {
// Note: A single uploaded file may produce multiple datafiles -
// for example, multiple files can be extracted from an uncompressed
// zip file.
dFileList = FileUtil.createDataFiles(workingVersion, uFile.getInputStream(), uFile.getFileName(), uFile.getContentType(), null, null, systemConfig);
// zip file.
CreateDataFileResult createDataFilesResult = FileUtil.createDataFiles(workingVersion, uFile.getInputStream(), uFile.getFileName(), uFile.getContentType(), null, null, systemConfig);
dFileList = createDataFilesResult.getDataFiles();
errorMessage = editDataFilesPageHelper.getHtmlErrorMessage(createDataFilesResult);

} catch (IOException ioex) {
logger.warning("Failed to process and/or save the file " + uFile.getFileName() + "; " + ioex.getMessage());
Expand Down Expand Up @@ -2111,7 +2125,9 @@ public void handleExternalUpload() {
// for example, multiple files can be extracted from an uncompressed
// zip file.
//datafiles = ingestService.createDataFiles(workingVersion, dropBoxStream, fileName, "application/octet-stream");
datafiles = FileUtil.createDataFiles(workingVersion, null, fileName, contentType, fullStorageIdentifier, checksumValue, checksumType, systemConfig);
CreateDataFileResult createDataFilesResult = FileUtil.createDataFiles(workingVersion, null, fileName, contentType, fullStorageIdentifier, checksumValue, checksumType, systemConfig);
datafiles = createDataFilesResult.getDataFiles();
errorMessage = editDataFilesPageHelper.getHtmlErrorMessage(createDataFilesResult);
} catch (IOException ex) {
logger.log(Level.SEVERE, "Error during ingest of file {0}", new Object[]{fileName});
}
Expand Down Expand Up @@ -3066,5 +3082,5 @@ public boolean isFileAccessRequest() {

public void setFileAccessRequest(boolean fileAccessRequest) {
this.fileAccessRequest = fileAccessRequest;
}
}
}
13 changes: 13 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/SettingsWrapper.java
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,19 @@ public boolean isTrueForKey(String settingKey, boolean safeDefaultIfKeyNotFound)
return ( val==null ) ? safeDefaultIfKeyNotFound : StringUtil.isTrue(val);
}

public Integer getInteger(String settingKey, Integer defaultValue) {
String settingValue = get(settingKey);
if(settingValue != null) {
try {
return Integer.valueOf(settingValue);
} catch (Exception e) {
logger.warning(String.format("action=getInteger result=invalid-integer settingKey=%s settingValue=%s", settingKey, settingValue));
}
}

return defaultValue;
}

private void initSettingsMap() {
// initialize settings map
settingsMap = new HashMap<>();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@
import javax.servlet.http.HttpServletRequest;
import javax.validation.ConstraintViolation;
import javax.validation.ConstraintViolationException;

import edu.harvard.iq.dataverse.util.file.CreateDataFileResult;
import org.swordapp.server.AuthCredentials;
import org.swordapp.server.Deposit;
import org.swordapp.server.DepositReceipt;
Expand Down Expand Up @@ -301,7 +303,8 @@ DepositReceipt replaceOrAddFiles(String uri, Deposit deposit, AuthCredentials au
List<DataFile> dataFiles = new ArrayList<>();
try {
try {
dataFiles = FileUtil.createDataFiles(editVersion, deposit.getInputStream(), uploadedZipFilename, guessContentTypeForMe, null, null, systemConfig);
CreateDataFileResult createDataFilesResponse = FileUtil.createDataFiles(editVersion, deposit.getInputStream(), uploadedZipFilename, guessContentTypeForMe, null, null, systemConfig);
dataFiles = createDataFilesResponse.getDataFiles();
} catch (EJBException ex) {
Throwable cause = ex.getCause();
if (cause != null) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
import edu.harvard.iq.dataverse.util.BundleUtil;
import edu.harvard.iq.dataverse.util.FileUtil;
import edu.harvard.iq.dataverse.util.SystemConfig;
import edu.harvard.iq.dataverse.util.file.CreateDataFileResult;
import edu.harvard.iq.dataverse.util.json.JsonPrinter;
import java.io.IOException;
import java.io.InputStream;
Expand Down Expand Up @@ -1206,14 +1207,15 @@ private boolean step_030_createNewFilesViaIngest(){
workingVersion = dataset.getEditVersion();
clone = workingVersion.cloneDatasetVersion();
try {
initialFileList = FileUtil.createDataFiles(workingVersion,
CreateDataFileResult result = FileUtil.createDataFiles(workingVersion,
this.newFileInputStream,
this.newFileName,
this.newFileContentType,
this.newStorageIdentifier,
this.newCheckSum,
this.newCheckSumType,
this.systemConfig);
initialFileList = result.getDataFiles();

} catch (IOException ex) {
if (!Strings.isNullOrEmpty(ex.getMessage())) {
Expand Down
Loading

0 comments on commit 61b073a

Please sign in to comment.