Support "holey" archival bags by qqmyers · Pull Request #12133 · IQSS/dataverse

qqmyers · 2026-01-30T23:10:45Z

What this PR does / why we need it: Previously, Dataverse always archived a dataset as a single zipped bag file whose size was limited only by the (compressed) size of the metadata and data. This PR adds support for the "holey" bag mechanism in which some/all datafiles are not sent in the bag but are instead listed in a fetch.txt file. To complete the bag, the receiver must read the fetch file, retrieve the listed files from their URLs and place them at the specified location (path/filename).

Whether files are included is determined by two new jvm options:

dataverse.bagit.holey.max-file-size
dataverse.bagit.holey.max-data-size

which take the largest allowed (uncompressed) data file size and the max aggregate (uncompressed) size that should be zipped.
Files are now processed in order of increasing size which means that the zip will include the largest number of files possible if the max-data-size limit is used.

Which issue(s) this PR closes:

Closes #
DANS DD-2157

Special notes for your reviewer:
This builds on #12063. I'm working on another PR that will go beyond this one to offer another option - instead of using a fetch file/requiring the receiver to get the missing files, they will be placed next to the zip file on the archiving service (local file, s3, Google, etc.). That gives size control w/o requiring some active/authenticated service to retrieve files from Dataverse. I'll update the docs and release note in that PR.

The internal BagGenerator was recently updated to include the gbrecs parameter to suppress download counts when the downloads are for archival purposes. This PR also adds that parameter to the URLs in the fetch.txt file to assure that the third-party receiver doesn't accidentally trigger download counts.

In it's first iteration, this PR assumes that the receiver will retrieve the missing files directly from Dataverse, which means it may need an API key/other credential to get those files. The next step will be to add an option for those files to be transferred separately to the same place the bag is sent and to adjust the fetch file accordingly. This will assure that, with holey bags, completion of the archiving by Dataverse means that all data is in the receiving service.

Suggestions on how to test this: Setup archiving and set either/both of the settings above and verify the split between files in the zip and listed in fetch.txt is correct.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

Spec doesn't allow empty lines, dropping whitespace-only lines seems reasonable as well (users can't see from the Dataverse display whether an empty line would appear in bag-info.txt or not if we all whotespace only lines (or whitespace beyond the 78 char wrap limit)

affects manifest and pid-mapping files as well as data file placement

Added unit tests for multilineWrap

This reverts commit 884b81b.

…nityConsortium/dataverse.git into OREBag1.0.2

qqmyers and others added 30 commits December 6, 2025 18:26

add checksum URI values and methods

c9f728b

update version and use checksum URIs

a25e47b

handle multiline descriptions and org names

6c0cb49

remove title as a folder

b0daad7

affects manifest and pid-mapping files as well as data file placement

handle null deaccession reason

e5457a8

use static to simplify testing

10b0556

Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2

d6cf1e2

Sanitize/split multiline catalog entry, add Dataverse-Bag-Version

6d24185

Added unit tests for multilineWrap

c4daf28

Removed unnecessary repeat helper method

e76bc91

Alined test names with actual test being done

108c912

Merge pull request #48 from janvanmansum/OREBag1.0.2-amend

62ea9d9

Added unit tests for multilineWrap

DD-2098 - allow archivalstatus calls on deaccessioned versions

884b81b

Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2

5e4e90a

set array properly

3076d69

Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2

cbdc15f

DD-2212 - use configured checksum when no files are present

1a7dafa

Revert "DD-2098 - allow archivalstatus calls on deaccessioned versions"

7eea57c

This reverts commit 884b81b.

add Source-Org as a potential multiline case, remove change to Int Id

2477cf9

release note

3f3908f

use constants, pass labelLength to wrapping, start custom lineWrap

aa44c08

update to handle overall 79 char length

8227edf

wrap any other potentially long values

d0749fc

cleanup deprecated code, auto-gen comments

24a625f

update comment

bf036f3

add tests

be65611

Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2

2516cf4

QDR updates to apache 5, better fault tolerance for file retrieval

24d098a

release note update

b4a3799

qqmyers added 13 commits January 16, 2026 18:22

Merge branch 'develop' into OREBag1.0.2

85a5239

Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2

e461415

suppress counting file retrieval to bag as a download in gb table

1b42978

Merge branch 'OREBag1.0.2' of https://github.com/GlobalDataverseCommu…

56de8cb

…nityConsortium/dataverse.git into OREBag1.0.2

Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2

3083179

basic fetch

49f4818

order by file size

7f5179f

only add subcollection folders (if they exist)

bc63285

replace deprecated constructs

59f3a2a

restore name collision check

69c9a0d

add null check to quiet log/avoid exception

422435a

cleanup - checksum change

d9cfe1d

cleanup, suppress downloads with gbrec for fetch file

4895f80

qqmyers marked this pull request as ready for review February 4, 2026 15:51

qqmyers added the Size: 3 A percentage of a sprint. 2.1 hours. label Feb 4, 2026

qqmyers added this to IQSS Dataverse Project Feb 4, 2026

qqmyers added this to the 6.10 milestone Feb 4, 2026

qqmyers mentioned this pull request Feb 4, 2026

Support partially unzipped archival bags #12144

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support "holey" archival bags#12133

Support "holey" archival bags#12133
qqmyers wants to merge 43 commits intoIQSS:developfrom
GlobalDataverseCommunityConsortium:DANS-2157_holey_bags2

qqmyers commented Jan 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

qqmyers commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qqmyers commented Jan 30, 2026 •

edited

Loading