Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GDCC/9005 replace files api call #9018

Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions doc/release-notes/9005-replaceFiles-api-call
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
9005

Direct upload and out-of-band uploads can now be used to replace multiple files with one API call (complementing the prior ability to add multiple new files)
49 changes: 7 additions & 42 deletions doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1511,6 +1511,13 @@ The fully expanded example above (without environment variables) looks like this

curl -H X-Dataverse-key: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx -X POST https://demo.dataverse.org/api/datasets/:persistentId/add?persistentId=doi:10.5072/FK2/J8SJZB -F 'jsonData={"description":"A remote image.","storageIdentifier":"trsa://themes/custom/qdr/images/CoreTrustSeal-logo-transparent.png","checksumType":"MD5","md5Hash":"509ef88afa907eaf2c17c1c8d8fde77e","label":"testlogo.png","fileName":"testlogo.png","mimeType":"image/png"}'

Adding Files To a Dataset via Other Tools
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In some circumstances, it may be useful to move or copy files into Dataverse's storage manually or via external tools and then add then to a dataset (i.e. without involving Dataverse in the file transfer itself).
Two API calls are available for this use case to add files to a dataset or to replace files that were already in the dataset.
These calls were developed as part of Dataverse's direct upload mechanism and are detailed in :doc:`/developers/s3-direct-upload-api`.

Report the data (file) size of a Dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -2366,48 +2373,6 @@ The fully expanded example above (without environment variables) looks like this
Note: The ``id`` returned in the json response is the id of the file metadata version.



Adding File Metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why you're removing this doc since the endpoint still exists and it seems like we would want to have the ability to add files' metadata, not just replace

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint doesn't work as ~implied by this section and this placement in the guides. You can't change the metadata of existing files with it. It is a way to add new files that have been uploaded by direct S3, out-of-band means, or when using the remoteOverlayStore. This call is documented already at https://guides.dataverse.org/en/latest/developers/s3-direct-upload-api.html?highlight=addfiles#to-add-multiple-uploaded-files-to-the-dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. thanks for the explanation

~~~~~~~~~~~~~~~~~~~~

This API call requires a ``jsonString`` expressing the metadata of multiple files. It adds file metadata to the database table where the file has already been copied to the storage.

The jsonData object includes values for:

* "description" - A description of the file
* "directoryLabel" - The "File Path" of the file, indicating which folder the file should be uploaded to within the dataset
* "storageIdentifier" - String
* "fileName" - String
* "mimeType" - String
* "fixity/checksum" either:

* "md5Hash" - String with MD5 hash value, or
* "checksum" - Json Object with "@type" field specifying the algorithm used and "@value" field with the value from that algorithm, both Strings

.. note:: See :ref:`curl-examples-and-environment-variables` if you are unfamiliar with the use of ``export`` below.

A curl example using an ``PERSISTENT_ID``

* ``SERVER_URL`` - e.g. https://demo.dataverse.org
* ``API_TOKEN`` - API endpoints require an API token that can be passed as the X-Dataverse-key HTTP header. For more details, see the :doc:`auth` section.
* ``PERSISTENT_IDENTIFIER`` - Example: ``doi:10.5072/FK2/7U7YBV``

.. code-block:: bash

export API_TOKEN=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
export SERVER_URL=https://demo.dataverse.org
export PERSISTENT_IDENTIFIER=doi:10.5072/FK2/7U7YBV
export JSON_DATA="[{'description':'My description.','directoryLabel':'data/subdir1','categories':['Data'], 'restrict':'false', 'storageIdentifier':'s3://demo-dataverse-bucket:176e28068b0-1c3f80357c42', 'fileName':'file1.txt', 'mimeType':'text/plain', 'checksum': {'@type': 'SHA-1', '@value': '123456'}}, \
{'description':'My description.','directoryLabel':'data/subdir1','categories':['Data'], 'restrict':'false', 'storageIdentifier':'s3://demo-dataverse-bucket:176e28068b0-1c3f80357d53', 'fileName':'file2.txt', 'mimeType':'text/plain', 'checksum': {'@type': 'SHA-1', '@value': '123789'}}]"

curl -X POST -H "X-Dataverse-key: $API_TOKEN" "$SERVER_URL/api/datasets/:persistentId/addFiles?persistentId=$PERSISTENT_IDENTIFIER" -F "jsonData=$JSON_DATA"

The fully expanded example above (without environment variables) looks like this:

.. code-block:: bash

curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" -X POST https://demo.dataverse.org/api/datasets/:persistentId/addFiles?persistentId=doi:10.5072/FK2/7U7YBV -F jsonData='[{"description":"My description.","directoryLabel":"data/subdir1","categories":["Data"], "restrict":"false", "storageIdentifier":"s3://demo-dataverse-bucket:176e28068b0-1c3f80357c42", "fileName":"file1.txt", "mimeType":"text/plain", "checksum": {"@type": "SHA-1", "@value": "123456"}}, {"description":"My description.","directoryLabel":"data/subdir1","categories":["Data"], "restrict":"false", "storageIdentifier":"s3://demo-dataverse-bucket:176e28068b0-1c3f80357d53", "fileName":"file2.txt", "mimeType":"text/plain", "checksum": {"@type": "SHA-1", "@value": "123789"}}]'

Updating File Metadata
~~~~~~~~~~~~~~~~~~~~~~

Expand Down
38 changes: 36 additions & 2 deletions doc/sphinx-guides/source/developers/s3-direct-upload-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ To add multiple Uploaded Files to the Dataset
---------------------------------------------

Once the files exists in the s3 bucket, a final API call is needed to add all the files to the Dataset. In this API call, additional metadata is added using the "jsonData" parameter.
jsonData normally includes information such as a file description, tags, provenance, whether the file is restricted, etc. For direct uploads, the jsonData object must also include values for:
jsonData for this call is an array of objects that normally include information such as a file description, tags, provenance, whether the file is restricted, etc. For direct uploads, the jsonData object must also include values for:

* "description" - A description of the file
* "directoryLabel" - The "File Path" of the file, indicating which folder the file should be uploaded to within the dataset
Expand Down Expand Up @@ -154,7 +154,7 @@ Replacing an existing file in the Dataset
-----------------------------------------

Once the file exists in the s3 bucket, a final API call is needed to register it as a replacement of an existing file. This call is the same call used to replace a file to a Dataverse installation but, rather than sending the file bytes, additional metadata is added using the "jsonData" parameter.
jsonData normally includes information such as a file description, tags, provenance, whether the file is restricted, whether to allow the mimetype to change (forceReplace=true), etc. For direct uploads, the jsonData object must also include values for:
jsonData normally includes information such as a file description, tags, provenance, whether the file is restricted, whether to allow the mimetype to change (forceReplace=true), etc. For direct uploads, the jsonData object must include values for:

* "storageIdentifier" - String, as specified in prior calls
* "fileName" - String
Expand All @@ -178,3 +178,37 @@ Note that the API call does not validate that the file matches the hash value su

Note that this API call can be used independently of the others, e.g. supporting use cases in which the file already exists in S3/has been uploaded via some out-of-band method.
With current S3 stores the object identifier must be in the correct bucket for the store, include the PID authority/identifier of the parent dataset, and be guaranteed unique, and the supplied storage identifer must be prefaced with the store identifier used in the Dataverse installation, as with the internally generated examples above.

Replacing multiple existing files in the Dataset
------------------------------------------------

Once the replacement files exist in the s3 bucket, a final API call is needed to register them as replacements for existing files. In this API call, additional metadata is added using the "jsonData" parameter.
jsonData for this call is array of objects that normally include information such as a file description, tags, provenance, whether the file is restricted, etc. For direct uploads, the jsonData object must include some additional values:

* "fileToReplaceId" - the id of the file being replaced
* "forceReplace" - whether to replace a file with one of a different mimetype (optional, default is false)
* "description" - A description of the file
* "directoryLabel" - The "File Path" of the file, indicating which folder the file should be uploaded to within the dataset
* "storageIdentifier" - String
* "fileName" - String
* "mimeType" - String
* "fixity/checksum" either:

* "md5Hash" - String with MD5 hash value, or
* "checksum" - Json Object with "@type" field specifying the algorithm used and "@value" field with the value from that algorithm, both Strings


The allowed checksum algorithms are defined by the edu.harvard.iq.dataverse.DataFile.CheckSumType class and currently include MD5, SHA-1, SHA-256, and SHA-512

.. code-block:: bash

export API_TOKEN=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
export SERVER_URL=https://demo.dataverse.org
export PERSISTENT_IDENTIFIER=doi:10.5072/FK2/7U7YBV
export JSON_DATA="[{'fileToReplaceId': 10, 'description':'My description.','directoryLabel':'data/subdir1','categories':['Data'], 'restrict':'false', 'storageIdentifier':'s3://demo-dataverse-bucket:176e28068b0-1c3f80357c42', 'fileName':'file1.txt', 'mimeType':'text/plain', 'checksum': {'@type': 'SHA-1', '@value': '123456'}}, \
{'fileToReplaceId': 10, 'forceReplace': true, 'description':'My description.','directoryLabel':'data/subdir1','categories':['Data'], 'restrict':'false', 'storageIdentifier':'s3://demo-dataverse-bucket:176e28068b0-1c3f80357d53', 'fileName':'file2.txt', 'mimeType':'text/plain', 'checksum': {'@type': 'SHA-1', '@value': '123789'}}]"

curl -X POST -H "X-Dataverse-key: $API_TOKEN" "$SERVER_URL/api/datasets/:persistentId/replaceFiles?persistentId=$PERSISTENT_IDENTIFIER" -F "jsonData=$JSON_DATA"

Note that this API call can be used independently of the others, e.g. supporting use cases in which the files already exists in S3/has been uploaded via some out-of-band method.
With current S3 stores the object identifier must be in the correct bucket for the store, include the PID authority/identifier of the parent dataset, and be guaranteed unique, and the supplied storage identifer must be prefaced with the store identifier used in the Dataverse installation, as with the internally generated examples above.
Original file line number Diff line number Diff line change
Expand Up @@ -1544,6 +1544,10 @@ public void finalizeFileDelete(Long dataFileId, String storageLocation) throws I
throw new IOException("Attempted to permanently delete a physical file still associated with an existing DvObject "
+ "(id: " + dataFileId + ", location: " + storageLocation);
}
if(storageLocation == null || storageLocation.isBlank()) {
throw new IOException("Attempted to delete a physical file with no location "
+ "(id: " + dataFileId + ", location: " + storageLocation);
}
StorageIO<DvObject> directStorageAccess = DataAccess.getDirectStorageIO(storageLocation);
directStorageAccess.delete();
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -586,8 +586,7 @@ public String init() {
datafileService,
permissionService,
commandEngine,
systemConfig,
licenseServiceBean);
systemConfig);

fileReplacePageHelper = new FileReplacePageHelper(addReplaceFileHelper,
dataset,
Expand Down
77 changes: 73 additions & 4 deletions src/main/java/edu/harvard/iq/dataverse/api/Datasets.java
Original file line number Diff line number Diff line change
Expand Up @@ -2452,8 +2452,7 @@ public Response addFileToDataset(@PathParam("id") String idSupplied,
fileService,
permissionSvc,
commandEngine,
systemConfig,
licenseSvc);
systemConfig);


//-------------------
Expand Down Expand Up @@ -3388,14 +3387,84 @@ public Response addFilesToDataset(@PathParam("id") String idSupplied,
this.fileService,
this.permissionSvc,
this.commandEngine,
this.systemConfig,
this.licenseSvc
this.systemConfig
);

return addFileHelper.addFiles(jsonData, dataset, authUser);

}

/**
* Replace multiple Files to an existing Dataset
*
* @param idSupplied
* @param jsonData
* @return
*/
@POST
@Path("{id}/replaceFiles")
@Consumes(MediaType.MULTIPART_FORM_DATA)
public Response replaceFilesInDataset(@PathParam("id") String idSupplied,
@FormDataParam("jsonData") String jsonData) {

if (!systemConfig.isHTTPUpload()) {
return error(Response.Status.SERVICE_UNAVAILABLE, BundleUtil.getStringFromBundle("file.api.httpDisabled"));
}

// -------------------------------------
// (1) Get the user from the API key
// -------------------------------------
User authUser;
try {
authUser = findUserOrDie();
} catch (WrappedResponse ex) {
return error(Response.Status.FORBIDDEN, BundleUtil.getStringFromBundle("file.addreplace.error.auth")
);
}

// -------------------------------------
// (2) Get the Dataset Id
// -------------------------------------
Dataset dataset;

try {
dataset = findDatasetOrDie(idSupplied);
} catch (WrappedResponse wr) {
return wr.getResponse();
}

dataset.getLocks().forEach(dl -> {
logger.info(dl.toString());
});

//------------------------------------
// (2a) Make sure dataset does not have package file
// --------------------------------------

for (DatasetVersion dv : dataset.getVersions()) {
if (dv.isHasPackageFile()) {
return error(Response.Status.FORBIDDEN,
BundleUtil.getStringFromBundle("file.api.alreadyHasPackageFile")
);
}
}

DataverseRequest dvRequest = createDataverseRequest(authUser);

AddReplaceFileHelper addFileHelper = new AddReplaceFileHelper(
dvRequest,
this.ingestService,
this.datasetService,
this.fileService,
this.permissionSvc,
this.commandEngine,
this.systemConfig
);

return addFileHelper.replaceFiles(jsonData, dataset, authUser);

}

/**
* API to find curation assignments and statuses
*
Expand Down
4 changes: 1 addition & 3 deletions src/main/java/edu/harvard/iq/dataverse/api/Files.java
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,6 @@ public Response replaceFileInDataset(
if (null == contentDispositionHeader) {
if (optionalFileParams.hasStorageIdentifier()) {
newStorageIdentifier = optionalFileParams.getStorageIdentifier();
// ToDo - check that storageIdentifier is valid
if (optionalFileParams.hasFileName()) {
newFilename = optionalFileParams.getFileName();
if (optionalFileParams.hasMimetype()) {
Expand All @@ -261,8 +260,7 @@ public Response replaceFileInDataset(
this.fileService,
this.permissionSvc,
this.commandEngine,
this.systemConfig,
this.licenseSvc);
this.systemConfig);

// (5) Run "runReplaceFileByDatasetId"
long fileToReplaceId = 0;
Expand Down
Loading