UFAL/S3 - The problem to download a big file - it does not show the progress by milanmajchrak · Pull Request #966 · dataquest-dev/DSpace

milanmajchrak · 2025-05-14T13:19:15Z

The code is cherry-picked from the upstream.

Phases	MP	MM	MB	MR	JM
ETA	0	0	0	0	0
Developing	0	0	0	0	0
Review	0	0	0	0	0
Total	-	-	-	-	-
ETA est.
ETA cust.	-	-	-	-	-

Problem description

Check the issue.

Summary by CodeRabbit

New Features
- Improved file download performance from S3 by introducing a streaming mechanism that downloads files in configurable chunks, reducing memory usage and wait times for large files.
Bug Fixes
- Enhanced error handling during file downloads to ensure temporary files are cleaned up in case of failures.
Tests
- Expanded test coverage for different file sizes and chunking scenarios.
- Added tests to confirm service behavior when S3 storage is disabled via configuration.

Porting of Improve S3 Bitstream Storage to Lazy download object from S3 to 7_x

coderabbitai · 2025-05-14T13:19:22Z

Walkthrough

The S3 bitstore service was refactored to implement a lazy-loading mechanism for retrieving bitstreams from S3, downloading data in configurable chunks instead of all at once. Tests were updated and expanded to verify chunked streaming, configuration-driven enablement, and correct service initialization based on configuration properties.

Changes

File(s)	Change Summary
dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java	Refactored `get(Bitstream)` to use a new inner class `S3LazyInputStream` for chunked, lazy S3 streaming; added buffer size configuration and improved error handling; updated `init()` to skip if disabled.
dspace-api/src/test/java/org/dspace/storage/bitstore/S3BitStoreServiceIT.java	Enhanced integration tests for chunked streaming with multiple bitstreams; added configuration-driven enablement/disablement setup; introduced helper method `checkGetPut()`; added test to verify service does not initialize when disabled.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant S3BitStoreService
    participant S3LazyInputStream
    participant S3

    Client->>S3BitStoreService: get(Bitstream)
    S3BitStoreService->>Client: returns S3LazyInputStream

    loop While reading InputStream
        Client->>S3LazyInputStream: read()
        alt Chunk not downloaded
            S3LazyInputStream->>S3: GET object (range: next chunk)
            S3-->>S3LazyInputStream: Chunk data
            S3LazyInputStream->>TempFile: Store chunk
        end
        S3LazyInputStream->>Client: bytes from chunk
    end

    Client->>S3LazyInputStream: close()
    S3LazyInputStream->>TempFile: Delete temp files

Poem

In the cloud where bitstreams soar,
Now chunks arrive, not all before.
Streams flow lazy, bytes on demand,
Rabbits cheer with wand in hand.
Tests now check each streaming part—
S3’s dance, a chunked-up art!
🐇✨

Tip

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 023f025 and 35d8473.

📒 Files selected for processing (2)

dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java (5 hunks)
dspace-api/src/test/java/org/dspace/storage/bitstore/S3BitStoreServiceIT.java (7 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java
dspace-api/src/test/java/org/dspace/storage/bitstore/S3BitStoreServiceIT.java

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: dspace-dependencies / docker-build (linux/amd64, ubuntu-latest, true)
GitHub Check: Run Integration Tests
GitHub Check: Run Unit Tests

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java (3)
665-667: Add validation for buffer size

The setter method should validate the buffer size to prevent setting invalid values (zero or negative).
public void setBufferSize(long bufferSize) {
+    if (bufferSize <= 0) {
+        throw new IllegalArgumentException("Buffer size must be positive");
+    }
    this.bufferSize = bufferSize;
}
677-744: Well-implemented lazy loading stream with proper resource management

The S3LazyInputStream implementation correctly handles:

Chunked downloads using range requests

Temporary file management

Stream lifecycle

Error handling

A few suggestions for improvement:

Consider adding a buffer size parameter check in the constructor

Consider implementing read(byte[], int, int) for better performance

It might be beneficial to add some logging for debugging purposes

Here's how you might implement the optimized read method:
@Override
public int read(byte[] b, int off, int len) throws IOException {
    // is the current chunk completely read and other are available?
    if (currPos == endOfChunk && currPos < fileSize) {
        currentChunkStream.close();
        downloadChunk();
    }

    if (currPos >= endOfChunk) {
        return -1;
    }

    int bytesToRead = (int) Math.min(len, endOfChunk - currPos);
    int bytesRead = currentChunkStream.read(b, off, bytesToRead);
    
    if (bytesRead > 0) {
        currPos += bytesRead;
    } else if (bytesRead == -1) {
        currentChunkStream.close();
    }
    
    return bytesRead;
}
717-735: Ensure proper cleanup of temporary files on errors

While the code does clean up the temporary file on exception, consider using Java's try-with-resources to ensure cleanup in all scenarios.
private void downloadChunk() throws IOException, FileNotFoundException {
    // Create a DownloadFileRequest with the desired byte range
    long startByte = currPos; // Start byte (inclusive)
    long endByte = Long.min(startByte + chunkMaxSize - 1, fileSize - 1); // End byte (inclusive)
    GetObjectRequest getRequest = new GetObjectRequest(bucketName, objectKey)
            .withRange(startByte, endByte);

    File currentChunkFile = File.createTempFile("s3-disk-copy-" + UUID.randomUUID(), "temp");
    currentChunkFile.deleteOnExit();
-   try {
+   try (FileOutputStream ignored = new FileOutputStream(currentChunkFile)) {
        Download download = tm.download(getRequest, currentChunkFile);
        download.waitForCompletion();
        currentChunkStream = new DeleteOnCloseFileInputStream(currentChunkFile);
        endOfChunk = endOfChunk + download.getProgress().getBytesTransferred();
    } catch (AmazonClientException | InterruptedException e) {
        currentChunkFile.delete();
        throw new IOException(e);
    }
}
dspace-api/src/test/java/org/dspace/storage/bitstore/S3BitStoreServiceIT.java (1)

1-467: Consider adding error tests for chunked downloads

The test coverage for happy path is excellent, but consider adding tests for error scenarios like:

Network interruption during chunk download

S3 returning errors for specific chunks

Testing with very large files or invalid ranges

This would ensure the error handling in the lazy loading implementation is robust.

🧰 Tools

🪛 ast-grep (0.31.1)

[warning] 420-420: Detected MD5 hash algorithm which is considered insecure. MD5 is not collision resistant and is therefore not suitable as a cryptographic signature. Use HMAC instead.
Context: "MD5"
Note: [CWE-328] Use of Weak Hash. [REFERENCES]
- https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(use-of-md5-java)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2031acf and 023f025.

📒 Files selected for processing (2)

dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java (4 hunks)
dspace-api/src/test/java/org/dspace/storage/bitstore/S3BitStoreServiceIT.java (7 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: dspace-dependencies / docker-build (linux/amd64, ubuntu-latest, true)
GitHub Check: Run Unit Tests
GitHub Check: Run Integration Tests

🔇 Additional comments (9)

dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java (3)

110-113: Good configuration approach for chunk size

The default 5MB buffer size for S3 chunks is a reasonable choice, aligning with AWS's recommended chunk size for multipart operations. The clear documentation explains the purpose of this configuration property.

298-298: Excellent refactoring to implement lazy-loading

The change from eager full download to lazy-loading with chunked streaming is a significant improvement for handling large files efficiently.

729-729:
Details

✅ Verification successful

Need to check for DeleteOnCloseFileInputStream availability

This code references DeleteOnCloseFileInputStream class which isn't defined in the file. Ensure this class is available or include its implementation.

🏁 Script executed:
#!/bin/bash
# Check if DeleteOnCloseFileInputStream is defined somewhere in the codebase
rg "class DeleteOnCloseFileInputStream"
Length of output: 196
DeleteOnCloseFileInputStream class verified

DeleteOnCloseFileInputStream is defined in the codebase at

dspace-api/src/main/java/org/dspace/storage/bitstore/DeleteOnCloseFileInputStream.java

No further action required.
dspace-api/src/test/java/org/dspace/storage/bitstore/S3BitStoreServiceIT.java (6)

85-86: Good approach using ConfigurationService for tests

Using the actual ConfigurationService instead of mocking improves test reliability by verifying the real integration behavior.

100-102: Nice test setup for buffer size

Setting a small buffer size (22 bytes) for tests is a great approach to verify chunked download with manageable test data sizes.

133-140: Comprehensive test coverage for chunking scenarios

The test data includes excellent coverage cases:

Content smaller than a chunk

Content spanning across chunk boundary

Content fitting exactly two chunks

Content spanning more than two chunks

This ensures the chunking logic handles all edge cases correctly.

143-146: Good test execution for multiple content sizes

Testing with various content sizes ensures the chunking mechanism works properly across different scenarios.

150-164: Excellent refactoring with checkGetPut helper method

The extracted helper method improves code readability and maintainability while ensuring consistent verification across all test cases.

408-417: Great test for configuration-driven initialization

This test properly verifies that the service respects the assetstore.s3.enabled configuration property and doesn't initialize when disabled.

vidiecan · 2025-05-15T10:22:08Z

dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java

+        @Override
+        public int read() throws IOException {
+            // is the current chunk completely read and other are available?
+            if (currPos == endOfChunk && currPos < fileSize) {


this condition is (at least) superfluous
currPos == endOfChunk && currPos < fileSize

Merge pull request DSpace#9484 from 4Science/s3lazy_7x

023f025

Porting of Improve S3 Bitstream Storage to Lazy download object from S3 to 7_x

coderabbitai bot reviewed May 14, 2025

View reviewed changes

milanmajchrak linked an issue May 14, 2025 that may be closed by this pull request

UFAL/S3 - The problem to download a big file - it does not show the progress #965

Open

milanmajchrak changed the title ~~https://github.com/dataquest-dev/DSpace/issues/965~~ Type / to search UFAL/S3 - The problem to download a big file - it does not show the progress May 14, 2025

DURACOM-211 S3store enabling according to assetstore.s3.enabled config

35d8473

milanmajchrak changed the title ~~Type / to search UFAL/S3 - The problem to download a big file - it does not show the progress~~ UFAL/S3 - The problem to download a big file - it does not show the progress May 15, 2025

milanmajchrak requested a review from vidiecan May 15, 2025 07:35

vidiecan requested changes May 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UFAL/S3 - The problem to download a big file - it does not show the progress#966

UFAL/S3 - The problem to download a big file - it does not show the progress#966
milanmajchrak wants to merge 2 commits intodtq-devfrom
ufal/s3-download-too-long

milanmajchrak commented May 14, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented May 14, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

vidiecan May 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

milanmajchrak commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem description

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

vidiecan May 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

milanmajchrak commented May 14, 2025 •

edited

Loading

coderabbitai bot commented May 14, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)