Skip to content

S3 Storage Integration

Paurikova2 edited this page Sep 2, 2025 · 16 revisions

Overview

CLARIN-DSpace supports storing bitstreams from S3-compatible object storage.
When enabled, bitstreams can be uploaded directly into S3 and downloaded by clients using presigned URLs or standard download.

  • This allows efficient and direct file transfer from S3 to the client.
  • Download statistics are preserved (requests still pass through CLARIN-DSpace before the presigned URL is issued).
  • Presigned URLs can be configured with a short lifetime, since downloads are expected to start immediately.

Configuration

S3 is configured through environment variables in the .env or dspace configuration files.

Example of additional .env configuration:

S3_STORAGE=1
S3_ENABLED=true

S3_RELATIVE_PATH=true
S3_BUCKET=your-bucket-name
S3_SUBFOLDER=assetstore
S3_ACCESS=XXXX
S3_SECRET=XXXXXX
S3_REGION_NAME=
S3_PATH_STYLE_ACCESS=false
S3_ENDPOINT=your-endpoint

Additional properties (local.cfg / dspace.cfg):

Enable direct S3 download via presigned URLs: s3.download.direct.enabled = true

Enable/disable keeping a local copy alongside S3: sync.storage.service.enabled = true

  • true → new uploads are stored in both the local assetstore and the S3 bucket.
    Bitstreams get store_number=77 in db. Downloads still come from the primary store (assetstore.index.primary).
    ⚠️ If S3 is down or the file is missing, the local copy is not automatically used.

  • false → new uploads are stored only in the primary store.
    If the primary is S3, nothing is written locally. If the primary is local, S3 is not used at all.

Examples:

  1. New repo, sync=true → files stored in /assetstore/ and S3.
  2. New repo, sync=false with S3 primary → files go only to S3.
  3. Existing repo enabling S3 later, sync=false → old files remain local, new ones go only to S3.
  4. Existing repo enabling S3 later, sync=true → old files remain local, new ones go to both local and S3.

Enable/disable multipart uploads to S3 (formerly false): s3.upload.by.parts.enabled = false

Storage & Synchronization Modes

The storage behavior depends on four factors:

  1. bitstream store_number from db
  2. assetstore.s3.enabled (on/off switch)
  3. assetstore.index.primary (which storage is primary for downloads)
  4. sync.storage.service.enabled (whether bitstreams are mirrored in both local + S3)

Upload

  • If s3.download.direct.enabled = true, files are downloaded directly from S3 using presigned url.
  • If sync.storage.service.enabled = true, files are stored both in S3 and local storage.

Download

  • Downloads always happen from the primary storage (assetstore.index.primary).
  • ⚠️ If a file is missing in S3 but present locally, there is no automatic fallback to local storage.

Use Cases

1. New Repository (S3 from the start)

  • 1.1 Sync enabled → Files stored in both S3 and local assetstore.
  • 1.2 Sync disabled → Files stored only in S3.

2. Existing Repository (adding S3 later)

  • 2.1 Replace local storage

    • Copy existing assetstore to S3.
    • Update configuration to set S3 as primary (assetstore.index.primary=1).
    • Ensure bitstream store_number values in db match the new storage setup.
  • 2.2 Extend with S3

    • 2.2.1 No sync → Old files remain local, new files go to S3.
    • 2.2.2 With sync → New uploads are stored in both local + S3.
      (But downloads still always come from primary, i.e., S3).

⚠️ Note: Bitstreams uploaded before S3 was enabled (with store_number=77 in db) may become inaccessible once S3 is set as primary. The system assumes they are in S3.


Known Issues & Limitations

  • No fallback download

    • If S3 is down or a file is missing, downloads fail even if the file exists locally.
  • Ambiguous store_number=77

    • Used when sync.storage.service.enabled=true.
    • Does not clearly identify the storage location (like local, S3 or other).
    • Suggested improvement: use explicit values (0=local, 1=S3, 2=other).
  • Old bitstreams

    • Files uploaded before enabling S3 (with store_number=77 in db) may break when S3 is set as primary.
  • Limited bucket configuration

    • S3 Storage integration doesn't allow different Communities|Collections storing bitstreams into different buckets
    • Example:
      • Community 1 may want to store bitstreams in bucket "bucket_1"
      • Community 2 may want to store bitstreams in bucket "bucket_2"
      • etc.

Clone this wiki locally