Skip to content

UFAL/S3 - The problem to download a big file - it does not show the progress#966

Open
milanmajchrak wants to merge 2 commits intodtq-devfrom
ufal/s3-download-too-long
Open

UFAL/S3 - The problem to download a big file - it does not show the progress#966
milanmajchrak wants to merge 2 commits intodtq-devfrom
ufal/s3-download-too-long

Conversation

@milanmajchrak
Copy link
Collaborator

@milanmajchrak milanmajchrak commented May 14, 2025

The code is cherry-picked from the upstream.

Phases MP MM MB MR JM Total
ETA 0 0 0 0 0 0
Developing 0 0 0 0 0 0
Review 0 0 0 0 0 0
Total - - - - - 0
ETA est. 0
ETA cust. - - - - - 0

Problem description

Check the issue.

Summary by CodeRabbit

  • New Features

    • Improved file download performance from S3 by introducing a streaming mechanism that downloads files in configurable chunks, reducing memory usage and wait times for large files.
  • Bug Fixes

    • Enhanced error handling during file downloads to ensure temporary files are cleaned up in case of failures.
  • Tests

    • Expanded test coverage for different file sizes and chunking scenarios.
    • Added tests to confirm service behavior when S3 storage is disabled via configuration.

Porting of Improve S3 Bitstream Storage to Lazy download object from S3 to 7_x
@coderabbitai
Copy link

coderabbitai bot commented May 14, 2025

Walkthrough

The S3 bitstore service was refactored to implement a lazy-loading mechanism for retrieving bitstreams from S3, downloading data in configurable chunks instead of all at once. Tests were updated and expanded to verify chunked streaming, configuration-driven enablement, and correct service initialization based on configuration properties.

Changes

File(s) Change Summary
dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java Refactored get(Bitstream) to use a new inner class S3LazyInputStream for chunked, lazy S3 streaming; added buffer size configuration and improved error handling; updated init() to skip if disabled.
dspace-api/src/test/java/org/dspace/storage/bitstore/S3BitStoreServiceIT.java Enhanced integration tests for chunked streaming with multiple bitstreams; added configuration-driven enablement/disablement setup; introduced helper method checkGetPut(); added test to verify service does not initialize when disabled.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant S3BitStoreService
    participant S3LazyInputStream
    participant S3

    Client->>S3BitStoreService: get(Bitstream)
    S3BitStoreService->>Client: returns S3LazyInputStream

    loop While reading InputStream
        Client->>S3LazyInputStream: read()
        alt Chunk not downloaded
            S3LazyInputStream->>S3: GET object (range: next chunk)
            S3-->>S3LazyInputStream: Chunk data
            S3LazyInputStream->>TempFile: Store chunk
        end
        S3LazyInputStream->>Client: bytes from chunk
    end

    Client->>S3LazyInputStream: close()
    S3LazyInputStream->>TempFile: Delete temp files
Loading

Poem

In the cloud where bitstreams soar,
Now chunks arrive, not all before.
Streams flow lazy, bytes on demand,
Rabbits cheer with wand in hand.
Tests now check each streaming part—
S3’s dance, a chunked-up art!
🐇✨

Tip

⚡️ Faster reviews with caching
  • CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 023f025 and 35d8473.

📒 Files selected for processing (2)
  • dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java (5 hunks)
  • dspace-api/src/test/java/org/dspace/storage/bitstore/S3BitStoreServiceIT.java (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java
  • dspace-api/src/test/java/org/dspace/storage/bitstore/S3BitStoreServiceIT.java
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: dspace-dependencies / docker-build (linux/amd64, ubuntu-latest, true)
  • GitHub Check: Run Integration Tests
  • GitHub Check: Run Unit Tests
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java (3)

665-667: Add validation for buffer size

The setter method should validate the buffer size to prevent setting invalid values (zero or negative).

public void setBufferSize(long bufferSize) {
+    if (bufferSize <= 0) {
+        throw new IllegalArgumentException("Buffer size must be positive");
+    }
    this.bufferSize = bufferSize;
}

677-744: Well-implemented lazy loading stream with proper resource management

The S3LazyInputStream implementation correctly handles:

  • Chunked downloads using range requests
  • Temporary file management
  • Stream lifecycle
  • Error handling

A few suggestions for improvement:

  1. Consider adding a buffer size parameter check in the constructor
  2. Consider implementing read(byte[], int, int) for better performance
  3. It might be beneficial to add some logging for debugging purposes

Here's how you might implement the optimized read method:

@Override
public int read(byte[] b, int off, int len) throws IOException {
    // is the current chunk completely read and other are available?
    if (currPos == endOfChunk && currPos < fileSize) {
        currentChunkStream.close();
        downloadChunk();
    }

    if (currPos >= endOfChunk) {
        return -1;
    }

    int bytesToRead = (int) Math.min(len, endOfChunk - currPos);
    int bytesRead = currentChunkStream.read(b, off, bytesToRead);
    
    if (bytesRead > 0) {
        currPos += bytesRead;
    } else if (bytesRead == -1) {
        currentChunkStream.close();
    }
    
    return bytesRead;
}

717-735: Ensure proper cleanup of temporary files on errors

While the code does clean up the temporary file on exception, consider using Java's try-with-resources to ensure cleanup in all scenarios.

private void downloadChunk() throws IOException, FileNotFoundException {
    // Create a DownloadFileRequest with the desired byte range
    long startByte = currPos; // Start byte (inclusive)
    long endByte = Long.min(startByte + chunkMaxSize - 1, fileSize - 1); // End byte (inclusive)
    GetObjectRequest getRequest = new GetObjectRequest(bucketName, objectKey)
            .withRange(startByte, endByte);

    File currentChunkFile = File.createTempFile("s3-disk-copy-" + UUID.randomUUID(), "temp");
    currentChunkFile.deleteOnExit();
-   try {
+   try (FileOutputStream ignored = new FileOutputStream(currentChunkFile)) {
        Download download = tm.download(getRequest, currentChunkFile);
        download.waitForCompletion();
        currentChunkStream = new DeleteOnCloseFileInputStream(currentChunkFile);
        endOfChunk = endOfChunk + download.getProgress().getBytesTransferred();
    } catch (AmazonClientException | InterruptedException e) {
        currentChunkFile.delete();
        throw new IOException(e);
    }
}
dspace-api/src/test/java/org/dspace/storage/bitstore/S3BitStoreServiceIT.java (1)

1-467: Consider adding error tests for chunked downloads

The test coverage for happy path is excellent, but consider adding tests for error scenarios like:

  • Network interruption during chunk download
  • S3 returning errors for specific chunks
  • Testing with very large files or invalid ranges

This would ensure the error handling in the lazy loading implementation is robust.

🧰 Tools
🪛 ast-grep (0.31.1)

[warning] 420-420: Detected MD5 hash algorithm which is considered insecure. MD5 is not collision resistant and is therefore not suitable as a cryptographic signature. Use HMAC instead.
Context: "MD5"
Note: [CWE-328] Use of Weak Hash. [REFERENCES]
- https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(use-of-md5-java)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2031acf and 023f025.

📒 Files selected for processing (2)
  • dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java (4 hunks)
  • dspace-api/src/test/java/org/dspace/storage/bitstore/S3BitStoreServiceIT.java (7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: dspace-dependencies / docker-build (linux/amd64, ubuntu-latest, true)
  • GitHub Check: Run Unit Tests
  • GitHub Check: Run Integration Tests
🔇 Additional comments (9)
dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java (3)

110-113: Good configuration approach for chunk size

The default 5MB buffer size for S3 chunks is a reasonable choice, aligning with AWS's recommended chunk size for multipart operations. The clear documentation explains the purpose of this configuration property.


298-298: Excellent refactoring to implement lazy-loading

The change from eager full download to lazy-loading with chunked streaming is a significant improvement for handling large files efficiently.


729-729:

Details

✅ Verification successful

Need to check for DeleteOnCloseFileInputStream availability

This code references DeleteOnCloseFileInputStream class which isn't defined in the file. Ensure this class is available or include its implementation.


🏁 Script executed:

#!/bin/bash
# Check if DeleteOnCloseFileInputStream is defined somewhere in the codebase
rg "class DeleteOnCloseFileInputStream"

Length of output: 196


DeleteOnCloseFileInputStream class verified

DeleteOnCloseFileInputStream is defined in the codebase at

  • dspace-api/src/main/java/org/dspace/storage/bitstore/DeleteOnCloseFileInputStream.java

No further action required.

dspace-api/src/test/java/org/dspace/storage/bitstore/S3BitStoreServiceIT.java (6)

85-86: Good approach using ConfigurationService for tests

Using the actual ConfigurationService instead of mocking improves test reliability by verifying the real integration behavior.


100-102: Nice test setup for buffer size

Setting a small buffer size (22 bytes) for tests is a great approach to verify chunked download with manageable test data sizes.


133-140: Comprehensive test coverage for chunking scenarios

The test data includes excellent coverage cases:

  • Content smaller than a chunk
  • Content spanning across chunk boundary
  • Content fitting exactly two chunks
  • Content spanning more than two chunks

This ensures the chunking logic handles all edge cases correctly.


143-146: Good test execution for multiple content sizes

Testing with various content sizes ensures the chunking mechanism works properly across different scenarios.


150-164: Excellent refactoring with checkGetPut helper method

The extracted helper method improves code readability and maintainability while ensuring consistent verification across all test cases.


408-417: Great test for configuration-driven initialization

This test properly verifies that the service respects the assetstore.s3.enabled configuration property and doesn't initialize when disabled.

@milanmajchrak milanmajchrak changed the title https://github.com/dataquest-dev/DSpace/issues/965 Type / to search UFAL/S3 - The problem to download a big file - it does not show the progress May 14, 2025
@milanmajchrak milanmajchrak changed the title Type / to search UFAL/S3 - The problem to download a big file - it does not show the progress UFAL/S3 - The problem to download a big file - it does not show the progress May 15, 2025
@milanmajchrak milanmajchrak requested a review from vidiecan May 15, 2025 07:35
@Override
public int read() throws IOException {
// is the current chunk completely read and other are available?
if (currPos == endOfChunk && currPos < fileSize) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this condition is (at least) superfluous
currPos == endOfChunk && currPos < fileSize

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UFAL/S3 - The problem to download a big file - it does not show the progress

4 participants