Skip to content

Support "holey" archival bags#12133

Open
qqmyers wants to merge 43 commits intoIQSS:developfrom
GlobalDataverseCommunityConsortium:DANS-2157_holey_bags2
Open

Support "holey" archival bags#12133
qqmyers wants to merge 43 commits intoIQSS:developfrom
GlobalDataverseCommunityConsortium:DANS-2157_holey_bags2

Conversation

@qqmyers
Copy link
Member

@qqmyers qqmyers commented Jan 30, 2026

What this PR does / why we need it: Previously, Dataverse always archived a dataset as a single zipped bag file whose size was limited only by the (compressed) size of the metadata and data. This PR adds support for the "holey" bag mechanism in which some/all datafiles are not sent in the bag but are instead listed in a fetch.txt file. To complete the bag, the receiver must read the fetch file, retrieve the listed files from their URLs and place them at the specified location (path/filename).

Whether files are included is determined by two new jvm options:

dataverse.bagit.holey.max-file-size
dataverse.bagit.holey.max-data-size

which take the largest allowed (uncompressed) data file size and the max aggregate (uncompressed) size that should be zipped.
Files are now processed in order of increasing size which means that the zip will include the largest number of files possible if the max-data-size limit is used.

Which issue(s) this PR closes:

  • Closes #
    DANS DD-2157

Special notes for your reviewer:
This builds on #12063. I'm working on another PR that will go beyond this one to offer another option - instead of using a fetch file/requiring the receiver to get the missing files, they will be placed next to the zip file on the archiving service (local file, s3, Google, etc.). That gives size control w/o requiring some active/authenticated service to retrieve files from Dataverse. I'll update the docs and release note in that PR.

The internal BagGenerator was recently updated to include the gbrecs parameter to suppress download counts when the downloads are for archival purposes. This PR also adds that parameter to the URLs in the fetch.txt file to assure that the third-party receiver doesn't accidentally trigger download counts.

In it's first iteration, this PR assumes that the receiver will retrieve the missing files directly from Dataverse, which means it may need an API key/other credential to get those files. The next step will be to add an option for those files to be transferred separately to the same place the bag is sent and to adjust the fetch file accordingly. This will assure that, with holey bags, completion of the archiving by Dataverse means that all data is in the receiving service.

Suggestions on how to test this: Setup archiving and set either/both of the settings above and verify the split between files in the zip and listed in fetch.txt is correct.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

qqmyers and others added 30 commits December 6, 2025 18:26
Spec doesn't allow empty lines, dropping whitespace-only lines seems
reasonable as well (users can't see from the Dataverse display whether
an empty line would appear in bag-info.txt or not if we all whotespace
only lines (or whitespace beyond the 78 char wrap limit)
affects manifest and pid-mapping files as well as data file placement
Added unit tests for multilineWrap
@qqmyers qqmyers marked this pull request as ready for review February 4, 2026 15:51
@qqmyers qqmyers added the Size: 3 A percentage of a sprint. 2.1 hours. label Feb 4, 2026
@qqmyers qqmyers added this to the 6.10 milestone Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Size: 3 A percentage of a sprint. 2.1 hours.

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants