Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
c9f728b
add checksum URI values and methods
qqmyers Dec 6, 2025
a25e47b
update version and use checksum URIs
qqmyers Dec 6, 2025
6c0cb49
handle multiline descriptions and org names
qqmyers Dec 6, 2025
7a34db8
drop blank lines in multiline values
qqmyers Dec 9, 2025
b0daad7
remove title as a folder
qqmyers Dec 9, 2025
e5457a8
handle null deaccession reason
qqmyers Dec 9, 2025
10b0556
use static to simplify testing
qqmyers Dec 10, 2025
d6cf1e2
Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2
qqmyers Dec 10, 2025
6d24185
Sanitize/split multiline catalog entry, add Dataverse-Bag-Version
qqmyers Dec 10, 2025
c4daf28
Added unit tests for multilineWrap
janvanmansum Dec 11, 2025
e76bc91
Removed unnecessary repeat helper method
janvanmansum Dec 11, 2025
108c912
Alined test names with actual test being done
janvanmansum Dec 11, 2025
62ea9d9
Merge pull request #48 from janvanmansum/OREBag1.0.2-amend
qqmyers Dec 11, 2025
884b81b
DD-2098 - allow archivalstatus calls on deaccessioned versions
qqmyers Dec 16, 2025
5e4e90a
Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2
qqmyers Dec 16, 2025
3076d69
set array properly
qqmyers Dec 17, 2025
cbdc15f
Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2
qqmyers Dec 19, 2025
1a7dafa
DD-2212 - use configured checksum when no files are present
qqmyers Dec 19, 2025
7eea57c
Revert "DD-2098 - allow archivalstatus calls on deaccessioned versions"
qqmyers Dec 19, 2025
2477cf9
add Source-Org as a potential multiline case, remove change to Int Id
qqmyers Dec 19, 2025
3f3908f
release note
qqmyers Dec 19, 2025
aa44c08
use constants, pass labelLength to wrapping, start custom lineWrap
qqmyers Dec 19, 2025
8227edf
update to handle overall 79 char length
qqmyers Dec 19, 2025
d0749fc
wrap any other potentially long values
qqmyers Dec 19, 2025
24a625f
cleanup deprecated code, auto-gen comments
qqmyers Dec 19, 2025
bf036f3
update comment
qqmyers Dec 22, 2025
be65611
add tests
qqmyers Dec 22, 2025
2516cf4
Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2
qqmyers Dec 22, 2025
24d098a
QDR updates to apache 5, better fault tolerance for file retrieval
qqmyers Dec 22, 2025
b4a3799
release note update
qqmyers Dec 22, 2025
85a5239
Merge branch 'develop' into OREBag1.0.2
qqmyers Jan 16, 2026
e461415
Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2
qqmyers Jan 28, 2026
1b42978
suppress counting file retrieval to bag as a download in gb table
qqmyers Jan 28, 2026
56de8cb
Merge branch 'OREBag1.0.2' of https://github.com/GlobalDataverseCommu…
qqmyers Jan 28, 2026
3083179
Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2
qqmyers Jan 29, 2026
49f4818
basic fetch
qqmyers Jan 30, 2026
7f5179f
order by file size
qqmyers Jan 30, 2026
bc63285
only add subcollection folders (if they exist)
qqmyers Jan 30, 2026
59f3a2a
replace deprecated constructs
qqmyers Jan 30, 2026
69c9a0d
restore name collision check
qqmyers Jan 30, 2026
422435a
add null check to quiet log/avoid exception
qqmyers Jan 30, 2026
d9cfe1d
cleanup - checksum change
qqmyers Jan 30, 2026
4895f80
cleanup, suppress downloads with gbrec for fetch file
qqmyers Jan 30, 2026
62a03b2
add setting, refactor, for non-holey option
qqmyers Feb 1, 2026
637b2e3
Update to track non-zipped files, add method
qqmyers Feb 4, 2026
a6b0505
reuse stream supplier, update archivers to send oversized files
qqmyers Feb 4, 2026
5739e35
docs, release note update
qqmyers Feb 4, 2026
5c82ab8
style fix
qqmyers Feb 4, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions doc/release-notes/12063-ORE-and-Bag-updates.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
This release contains multiple updates to the OAI-ORE metadata export and archival Bag output:

OAI-ORE
- now uses URI for checksum algorithms
- a bug causing failures with deaccessioned versions when the deaccession note ("Deaccession Reason" in the UI) was null (which has been allowed via the API).
- the "https://schema.org/additionalType" is updated to "Dataverse OREMap Format v1.0.2" to indicate that the out has changed

Archival Bag
- for dataset versions with no files, the (empty) manifest-<alg>.txt file created will now use the default algorithm defined by the "FileFixityChecksumAlgorithm" setting rather than always defaulting to "md5"
- a bug causing the bag-info.txt to not have information on contacts when the dataset version has more than one contact has been fixed
- values used in the bag-info.txt file that may be multi-line (with embedded CR or LF characters) are now properly indented/formatted per the BagIt specification (i.e. Internal-Sender-Identifier, External-Description, Source-Organization, Organization-Address).
- the name of the dataset is no longer used as a subdirectory under the data directory (dataset names can be long enough to cause failures when unzipping)
- a new key, "Dataverse-Bag-Version" has been added to bag-info.txt with a value "1.0", allowing tracking of changes to Dataverse's arhival bag generation
- improvements to file retrieval w.r.t. retries on errors or throttling
- retrieval of files for inclusion in the bag is no longer counted as a download by Dataverse
21 changes: 21 additions & 0 deletions doc/release-notes/12144-un-holey-bags.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
This release contains multiple updates to the OAI-ORE metadata export and archival Bag output:

OAI-ORE
- now uses URI for checksum algorithms
- a bug causing failures with deaccessioned versions when the deaccession note ("Deaccession Reason" in the UI) was null (which has been allowed via the API).
- the "https://schema.org/additionalType" is updated to "Dataverse OREMap Format v1.0.2" to indicate that the out has changed

Archival Bag
- for dataset versions with no files, the (empty) manifest-<alg>.txt file created will now use the default algorithm defined by the "FileFixityChecksumAlgorithm" setting rather than always defaulting to "md5"
- a bug causing the bag-info.txt to not have information on contacts when the dataset version has more than one contact has been fixed
- values used in the bag-info.txt file that may be multi-line (with embedded CR or LF characters) are now properly indented/formatted per the BagIt specification (i.e. Internal-Sender-Identifier, External-Description, Source-Organization, Organization-Address).
- the name of the dataset is no longer used as a subdirectory under the data directory (dataset names can be long enough to cause failures when unzipping)
- a new key, "Dataverse-Bag-Version" has been added to bag-info.txt with a value "1.0", allowing tracking of changes to Dataverse's arhival bag generation
- improvements to file retrieval w.r.t. retries on errors or throttling
- retrieval of files for inclusion in the bag is no longer counted as a download by Dataverse
- the size of data files and total dataset size that will be included in an archival bag can now be limited. Admins can choose whether files above these limits are transferred along with the zipped bag (creating a complete archival copy) or are just referenced (using the concept of a "holey" bag and just listing the oversized files and the Dataverse urls from which they can be retrieved. In the holey bag case, an active service on the archiving platform must retrieve the oversized files (using appropriate credentials as needed) to make a complete copy

### New JVM Options (MicroProfile Config Settings)
dataverse.bagit.zip.holey
dataverse.bagit.zip.max-data-size
dataverse.bagit.zip.max-file-size
1 change: 1 addition & 0 deletions doc/sphinx-guides/source/admin/big-data-administration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,7 @@ There are a broad range of options (that are not turned on by default) for impro
- :ref:`:DisableSolrFacetsWithoutJsession` - disables facets for users who have disabled cookies (e.g. for bots)
- :ref:`:DisableUncheckedTypesFacet` - only disables the facet showing the number of collections, datasets, files matching the query (this facet is potentially less useful than others)
- :ref:`:StoreIngestedTabularFilesWithVarHeaders` - by default, Dataverse stores ingested files without headers and dynamically adds them back at download time. Once this setting is enabled, Dataverse will leave the headers in place (for newly ingested files), reducing the cost of downloads
- :ref:`dataverse.bagit.zip.max-file-size`, :ref:`dataverse.bagit.zip.max-data-size`, and :ref:`dataverse.bagit.zip.holey` - options to control the size and temporary storage requirements when generating archival Bags - see :ref:`BagIt Export`


Scaling Infrastructure
Expand Down
17 changes: 17 additions & 0 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2259,6 +2259,8 @@ These archival Bags include all of the files and metadata in a given dataset ver

The Dataverse Software offers an internal archive workflow which may be configured as a PostPublication workflow via an admin API call to manually submit previously published Datasets and prior versions to a configured archive such as Chronopolis. The workflow creates a `JSON-LD <http://www.openarchives.org/ore/0.9/jsonld>`_ serialized `OAI-ORE <https://www.openarchives.org/ore/>`_ map file, which is also available as a metadata export format in the Dataverse Software web interface.

The size of the zipped archival Bag can be limited, and files that don't fit within that limit can either be transferred separately (placed so that they are correctly positioned according to the BagIt specification when the zipped bag in unzipped in place) or just referenced for later download (using the BagIt concept of a 'holey' bag with a list of files in a ``fetch.txt`` file) can now be configured for all archivers. These settings allow for managing large datasets by excluding files over a certain size or total data size, which can be useful for archivers with size limitations or to reduce transfer times. See the :ref:`dataverse.bagit.zip.max-file-size`, :ref:`dataverse.bagit.zip.max-data-size`, and :ref:`dataverse.bagit.zip.holey` JVM options for more details.

At present, archiving classes include the DuraCloudSubmitToArchiveCommand, LocalSubmitToArchiveCommand, GoogleCloudSubmitToArchive, and S3SubmitToArchiveCommand , which all extend the AbstractSubmitToArchiveCommand and use the configurable mechanisms discussed below. (A DRSSubmitToArchiveCommand, which works with Harvard's DRS also exists and, while specific to DRS, is a useful example of how Archivers can support single-version-only semantics and support archiving only from specified collections (with collection specific parameters)).

All current options support the :ref:`Archival Status API` calls and the same status is available in the dataset page version table (for contributors/those who could view the unpublished dataset, with more detail available to superusers).
Expand Down Expand Up @@ -3868,6 +3870,21 @@ This can instead be restricted to only superusers who can publish the dataset us

Example: ``dataverse.coar-notify.relationship-announcement.notify-superusers-only=true``

.. _dataverse.bagit.zip.holey:

``dataverse.bagit.zip.holey``
A boolean that, if true, will cause the BagIt archiver to create a "holey" bag. In a holey bag, files that are not included in the bag are listed in the ``fetch.txt`` file with a URL from which they can be downloaded. This is used in conjunction with ``dataverse.bagit.zip.max-file-size`` and/or ``dataverse.bagit.zip.max-data-size``. Default: false.

.. _dataverse.bagit.zip.max-data-size:

``dataverse.bagit.zip.max-data-size``
The maximum total (uncompressed) size of data files (in bytes) to include in a BagIt zip archive. If the total size of the dataset files exceeds this limit, files will be excluded from the zipped bag (starting from the largest) until the total size is under the limit. Excluded files will be handled as defined by ``dataverse.bagit.zip.holey`` - just listed if that setting is true or being transferred separately and placed next to the zipped bag. When not set, there is no limit.

.. _dataverse.bagit.zip.max-file-size:

``dataverse.bagit.zip.max-file-size``
The maximum (uncompressed) size of a single file (in bytes) to include in a BagIt zip archive. Any file larger than this will be excluded. Excluded files will be handled as defined by ``dataverse.bagit.zip.holey`` - just listed if that setting is true or being transferred separately and placed next to the zipped bag. When not set, there is no limit.

.. _feature-flags:

Feature Flags
Expand Down
33 changes: 27 additions & 6 deletions src/main/java/edu/harvard/iq/dataverse/DataFile.java
Original file line number Diff line number Diff line change
Expand Up @@ -109,18 +109,22 @@ public class DataFile extends DvObject implements Comparable {
* The list of types should be limited to the list above in the technote
* because the string gets passed into MessageDigest.getInstance() and you
* can't just pass in any old string.
*
* The URIs are used in the OAI_ORE export. They are taken from the associated XML Digital Signature standards.
*/
public enum ChecksumType {

MD5("MD5"),
SHA1("SHA-1"),
SHA256("SHA-256"),
SHA512("SHA-512");
MD5("MD5", "http://www.w3.org/2001/04/xmldsig-more#md5"),
SHA1("SHA-1", "http://www.w3.org/2000/09/xmldsig#sha1"),
SHA256("SHA-256", "http://www.w3.org/2001/04/xmlenc#sha256"),
SHA512("SHA-512", "http://www.w3.org/2001/04/xmlenc#sha512");

private final String text;
private final String uri;

private ChecksumType(final String text) {
private ChecksumType(final String text, final String uri) {
this.text = text;
this.uri = uri;
}

public static ChecksumType fromString(String text) {
Expand All @@ -131,13 +135,30 @@ public static ChecksumType fromString(String text) {
}
}
}
throw new IllegalArgumentException("ChecksumType must be one of these values: " + Arrays.asList(ChecksumType.values()) + ".");
throw new IllegalArgumentException(
"ChecksumType must be one of these values: " + Arrays.asList(ChecksumType.values()) + ".");
}

public static ChecksumType fromUri(String uri) {
if (uri != null) {
for (ChecksumType checksumType : ChecksumType.values()) {
if (uri.equals(checksumType.uri)) {
return checksumType;
}
}
}
throw new IllegalArgumentException(
"ChecksumType must be one of these values: " + Arrays.asList(ChecksumType.values()) + ".");
}

@Override
public String toString() {
return text;
}

public String toUri() {
return uri;
}
}

//@Expose
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,24 @@
import edu.harvard.iq.dataverse.authorization.users.ApiToken;
import edu.harvard.iq.dataverse.engine.command.DataverseRequest;
import edu.harvard.iq.dataverse.engine.command.RequiredPermissions;
import edu.harvard.iq.dataverse.util.bagit.BagGenerator;
import edu.harvard.iq.dataverse.util.bagit.OREMap;

import static edu.harvard.iq.dataverse.settings.SettingsServiceBean.Key.DuraCloudContext;
import static edu.harvard.iq.dataverse.settings.SettingsServiceBean.Key.DuraCloudHost;
import static edu.harvard.iq.dataverse.settings.SettingsServiceBean.Key.DuraCloudPort;
import edu.harvard.iq.dataverse.workflow.step.Failure;
import edu.harvard.iq.dataverse.workflow.step.WorkflowStepResult;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.PipedInputStream;
import java.io.PipedOutputStream;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.security.DigestInputStream;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
Expand Down Expand Up @@ -96,6 +104,8 @@ public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken t
statusObject.add(DatasetVersion.ARCHIVAL_STATUS, DatasetVersion.ARCHIVAL_STATUS_FAILURE);
statusObject.add(DatasetVersion.ARCHIVAL_STATUS_MESSAGE, "Bag not transferred");

Path tempBagFile = null;

try {
/*
* If there is a failure in creating a space, it is likely that a prior version
Expand Down Expand Up @@ -161,20 +171,38 @@ public void run() {
// Add BagIt ZIP file
// Although DuraCloud uses SHA-256 internally, it's API uses MD5 to verify the
// transfer
Path bagFile = null;


messageDigest = MessageDigest.getInstance("MD5");
try (PipedInputStream in = new PipedInputStream(100000);
DigestInputStream digestInputStream2 = new DigestInputStream(in, messageDigest)) {
Thread bagThread = startBagThread(dv, in, digestInputStream2, dataciteXml, token);
checksum = store.addContent(spaceName, fileName, digestInputStream2, -1l, null, null, null);
bagThread.join();
if (success) {
logger.fine("Content: " + fileName + " added with checksum: " + checksum);
localchecksum = Hex.encodeHexString(digestInputStream2.getMessageDigest().digest());
tempBagFile = Files.createTempFile("dataverse-bag-", ".zip");
logger.fine("Creating bag in temporary file: " + tempBagFile.toString());

BagGenerator bagger = new BagGenerator(new OREMap(dv, false), dataciteXml);
bagger.setAuthenticationKey(token.getTokenString());
// Generate bag to temporary file using the provided ore JsonObject
try (FileOutputStream fos = new FileOutputStream(tempBagFile.toFile())) {
if (!bagger.generateBag(fos)) {
throw new IOException("Bag generation failed");
}
if (!success || !checksum.equals(localchecksum)) {
}

// Store BagIt file
long bagSize = Files.size(tempBagFile);
logger.fine("Bag created successfully, size: " + bagSize + " bytes");

// Now upload the bag file
messageDigest = MessageDigest.getInstance("MD5");
try (InputStream is = Files.newInputStream(bagFile);
DigestInputStream bagDigestInputStream = new DigestInputStream(is, messageDigest)) {
checksum = store.addContent(spaceName, fileName, bagDigestInputStream, bagFile.toFile().length(), "application/zip", null, null);
localchecksum = Hex.encodeHexString(bagDigestInputStream.getMessageDigest().digest());

if (checksum != null && checksum.equals(localchecksum)) {
logger.fine("Content: " + fileName + " added with checksum: " + checksum);
success = true;
} else {
logger.severe("Failure on " + fileName);
logger.severe(success ? checksum + " not equal to " + localchecksum : "failed to transfer to DuraCloud");
logger.severe(checksum + " not equal to " + localchecksum);
try {
store.deleteContent(spaceName, fileName);
store.deleteContent(spaceName, baseFileName + "_datacite.xml");
Expand All @@ -185,9 +213,6 @@ public void run() {
"DuraCloud Submission Failure: incomplete archive transfer");
}
}

logger.fine("DuraCloud Submission step: Content Transferred");

// Document the location of dataset archival copy location (actually the URL
// where you can
// view it as an admin)
Expand Down Expand Up @@ -223,8 +248,20 @@ public void run() {
return new Failure("Unable to create DuraCloud space with name: " + baseFileName, mesg);
} catch (NoSuchAlgorithmException e) {
logger.severe("MD5 MessageDigest not available!");
} catch (Exception e) {
logger.warning(e.getLocalizedMessage());
e.printStackTrace();
return new Failure("Error in transferring file to DuraCloud",
"DuraCloud Submission Failure: internal error");
}
finally {
if (tempBagFile != null) {
try {
Files.deleteIfExists(tempBagFile);
} catch (IOException e) {
logger.warning("Failed to delete temporary bag file: " + tempBagFile + " : " + e.getMessage());
}
}
dv.setArchivalCopyLocation(statusObject.build().toString());
}
} else {
Expand Down
Loading