-
Notifications
You must be signed in to change notification settings - Fork 2
S3 Storage Integration
CLARIN-DSpace supports storing bitstreams from S3-compatible object storage.
When enabled, bitstreams can be uploaded directly into S3 and downloaded by clients using presigned URLs or standard download.
- This allows efficient and direct file transfer from S3 to the client.
- Download statistics are preserved (requests still pass through CLARIN-DSpace before the presigned URL is issued).
- Presigned URLs can be configured with a short lifetime, since downloads are expected to start immediately.
S3 is configured through environment variables in the .env or dspace configuration files.
Example of additional .env configuration:
S3_STORAGE=1
S3_ENABLED=true
S3_RELATIVE_PATH=true
S3_BUCKET=your-bucket-name
S3_SUBFOLDER=assetstore
S3_ACCESS=XXXX
S3_SECRET=XXXXXX
S3_REGION_NAME=
S3_PATH_STYLE_ACCESS=false
S3_ENDPOINT=your-endpointEnable direct S3 download via presigned URLs:
s3.download.direct.enabled = true
Enable/disable keeping a local copy alongside S3:
sync.storage.service.enabled = true
-
true → new uploads are stored in both the local assetstore and the S3 bucket.
Bitstreams getstore_number=77in db. Downloads still come from the primary store (assetstore.index.primary).
⚠️ If S3 is down or the file is missing, the local copy is not automatically used. -
false → new uploads are stored only in the primary store.
If the primary is S3, nothing is written locally. If the primary is local, S3 is not used at all.
Examples:
- New repo, sync=true → files stored in
/assetstore/and S3. - New repo, sync=false with S3 primary → files go only to S3.
- Existing repo enabling S3 later, sync=false → old files remain local, new ones go only to S3.
- Existing repo enabling S3 later, sync=true → old files remain local, new ones go to both local and S3.
Enable/disable multipart uploads to S3 (formerly false):
s3.upload.by.parts.enabled = false
The storage behavior depends on four factors:
- bitstream
store_numberfrom db -
assetstore.s3.enabled(on/off switch) -
assetstore.index.primary(which storage is primary for downloads) -
sync.storage.service.enabled(whether bitstreams are mirrored in both local + S3)
- If
s3.download.direct.enabled = true, files are downloaded directly from S3 using presigned url. - If
sync.storage.service.enabled = true, files are stored both in S3 and local storage.
- Downloads always happen from the primary storage (
assetstore.index.primary). ⚠️ If a file is missing in S3 but present locally, there is no automatic fallback to local storage.
- 1.1 Sync enabled → Files stored in both S3 and local assetstore.
- 1.2 Sync disabled → Files stored only in S3.
-
2.1 Replace local storage
- Copy existing
assetstoreto S3. - Update configuration to set S3 as primary (
assetstore.index.primary=1). - Ensure bitstream
store_numbervalues in db match the new storage setup.
- Copy existing
-
2.2 Extend with S3
- 2.2.1 No sync → Old files remain local, new files go to S3.
-
2.2.2 With sync → New uploads are stored in both local + S3.
(But downloads still always come from primary, i.e., S3).
⚠️ Note: Bitstreams uploaded before S3 was enabled (withstore_number=77in db) may become inaccessible once S3 is set as primary. The system assumes they are in S3.
-
No fallback download
- If S3 is down or a file is missing, downloads fail even if the file exists locally.
-
Ambiguous
store_number=77- Used when
sync.storage.service.enabled=true. - Does not clearly identify the storage location (like local, S3 or other).
- Suggested improvement: use explicit values (
0=local,1=S3,2=other).
- Used when
-
Old bitstreams
- Files uploaded before enabling S3 (with
store_number=77in db) may break when S3 is set as primary.
- Files uploaded before enabling S3 (with
-
Limited bucket configuration
- S3 Storage integration doesn't allow different Communities|Collections storing bitstreams into different buckets
-
Example:
- Community 1 may want to store bitstreams in bucket "bucket_1"
- Community 2 may want to store bitstreams in bucket "bucket_2"
- etc.