UFAL/S3 - The problem to download a big file - it does not show the progress#966
UFAL/S3 - The problem to download a big file - it does not show the progress#966milanmajchrak wants to merge 2 commits intodtq-devfrom
Conversation
Porting of Improve S3 Bitstream Storage to Lazy download object from S3 to 7_x
WalkthroughThe S3 bitstore service was refactored to implement a lazy-loading mechanism for retrieving bitstreams from S3, downloading data in configurable chunks instead of all at once. Tests were updated and expanded to verify chunked streaming, configuration-driven enablement, and correct service initialization based on configuration properties. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant S3BitStoreService
participant S3LazyInputStream
participant S3
Client->>S3BitStoreService: get(Bitstream)
S3BitStoreService->>Client: returns S3LazyInputStream
loop While reading InputStream
Client->>S3LazyInputStream: read()
alt Chunk not downloaded
S3LazyInputStream->>S3: GET object (range: next chunk)
S3-->>S3LazyInputStream: Chunk data
S3LazyInputStream->>TempFile: Store chunk
end
S3LazyInputStream->>Client: bytes from chunk
end
Client->>S3LazyInputStream: close()
S3LazyInputStream->>TempFile: Delete temp files
Poem
Tip ⚡️ Faster reviews with caching
Enjoy the performance boost—your workflow just got faster. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms (3)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java (3)
665-667: Add validation for buffer sizeThe setter method should validate the buffer size to prevent setting invalid values (zero or negative).
public void setBufferSize(long bufferSize) { + if (bufferSize <= 0) { + throw new IllegalArgumentException("Buffer size must be positive"); + } this.bufferSize = bufferSize; }
677-744: Well-implemented lazy loading stream with proper resource managementThe
S3LazyInputStreamimplementation correctly handles:
- Chunked downloads using range requests
- Temporary file management
- Stream lifecycle
- Error handling
A few suggestions for improvement:
- Consider adding a buffer size parameter check in the constructor
- Consider implementing
read(byte[], int, int)for better performance- It might be beneficial to add some logging for debugging purposes
Here's how you might implement the optimized read method:
@Override public int read(byte[] b, int off, int len) throws IOException { // is the current chunk completely read and other are available? if (currPos == endOfChunk && currPos < fileSize) { currentChunkStream.close(); downloadChunk(); } if (currPos >= endOfChunk) { return -1; } int bytesToRead = (int) Math.min(len, endOfChunk - currPos); int bytesRead = currentChunkStream.read(b, off, bytesToRead); if (bytesRead > 0) { currPos += bytesRead; } else if (bytesRead == -1) { currentChunkStream.close(); } return bytesRead; }
717-735: Ensure proper cleanup of temporary files on errorsWhile the code does clean up the temporary file on exception, consider using Java's try-with-resources to ensure cleanup in all scenarios.
private void downloadChunk() throws IOException, FileNotFoundException { // Create a DownloadFileRequest with the desired byte range long startByte = currPos; // Start byte (inclusive) long endByte = Long.min(startByte + chunkMaxSize - 1, fileSize - 1); // End byte (inclusive) GetObjectRequest getRequest = new GetObjectRequest(bucketName, objectKey) .withRange(startByte, endByte); File currentChunkFile = File.createTempFile("s3-disk-copy-" + UUID.randomUUID(), "temp"); currentChunkFile.deleteOnExit(); - try { + try (FileOutputStream ignored = new FileOutputStream(currentChunkFile)) { Download download = tm.download(getRequest, currentChunkFile); download.waitForCompletion(); currentChunkStream = new DeleteOnCloseFileInputStream(currentChunkFile); endOfChunk = endOfChunk + download.getProgress().getBytesTransferred(); } catch (AmazonClientException | InterruptedException e) { currentChunkFile.delete(); throw new IOException(e); } }dspace-api/src/test/java/org/dspace/storage/bitstore/S3BitStoreServiceIT.java (1)
1-467: Consider adding error tests for chunked downloadsThe test coverage for happy path is excellent, but consider adding tests for error scenarios like:
- Network interruption during chunk download
- S3 returning errors for specific chunks
- Testing with very large files or invalid ranges
This would ensure the error handling in the lazy loading implementation is robust.
🧰 Tools
🪛 ast-grep (0.31.1)
[warning] 420-420: Detected MD5 hash algorithm which is considered insecure. MD5 is not collision resistant and is therefore not suitable as a cryptographic signature. Use HMAC instead.
Context: "MD5"
Note: [CWE-328] Use of Weak Hash. [REFERENCES]
- https://owasp.org/Top10/A02_2021-Cryptographic_Failures(use-of-md5-java)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java(4 hunks)dspace-api/src/test/java/org/dspace/storage/bitstore/S3BitStoreServiceIT.java(7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: dspace-dependencies / docker-build (linux/amd64, ubuntu-latest, true)
- GitHub Check: Run Unit Tests
- GitHub Check: Run Integration Tests
🔇 Additional comments (9)
dspace-api/src/main/java/org/dspace/storage/bitstore/S3BitStoreService.java (3)
110-113: Good configuration approach for chunk sizeThe default 5MB buffer size for S3 chunks is a reasonable choice, aligning with AWS's recommended chunk size for multipart operations. The clear documentation explains the purpose of this configuration property.
298-298: Excellent refactoring to implement lazy-loadingThe change from eager full download to lazy-loading with chunked streaming is a significant improvement for handling large files efficiently.
729-729:Details
✅ Verification successful
Need to check for DeleteOnCloseFileInputStream availability
This code references
DeleteOnCloseFileInputStreamclass which isn't defined in the file. Ensure this class is available or include its implementation.
🏁 Script executed:
#!/bin/bash # Check if DeleteOnCloseFileInputStream is defined somewhere in the codebase rg "class DeleteOnCloseFileInputStream"Length of output: 196
DeleteOnCloseFileInputStream class verified
DeleteOnCloseFileInputStreamis defined in the codebase at
- dspace-api/src/main/java/org/dspace/storage/bitstore/DeleteOnCloseFileInputStream.java
No further action required.
dspace-api/src/test/java/org/dspace/storage/bitstore/S3BitStoreServiceIT.java (6)
85-86: Good approach using ConfigurationService for testsUsing the actual ConfigurationService instead of mocking improves test reliability by verifying the real integration behavior.
100-102: Nice test setup for buffer sizeSetting a small buffer size (22 bytes) for tests is a great approach to verify chunked download with manageable test data sizes.
133-140: Comprehensive test coverage for chunking scenariosThe test data includes excellent coverage cases:
- Content smaller than a chunk
- Content spanning across chunk boundary
- Content fitting exactly two chunks
- Content spanning more than two chunks
This ensures the chunking logic handles all edge cases correctly.
143-146: Good test execution for multiple content sizesTesting with various content sizes ensures the chunking mechanism works properly across different scenarios.
150-164: Excellent refactoring with checkGetPut helper methodThe extracted helper method improves code readability and maintainability while ensuring consistent verification across all test cases.
408-417: Great test for configuration-driven initializationThis test properly verifies that the service respects the
assetstore.s3.enabledconfiguration property and doesn't initialize when disabled.
| @Override | ||
| public int read() throws IOException { | ||
| // is the current chunk completely read and other are available? | ||
| if (currPos == endOfChunk && currPos < fileSize) { |
There was a problem hiding this comment.
this condition is (at least) superfluous
currPos == endOfChunk && currPos < fileSize
The code is cherry-picked from the upstream.
Problem description
Check the issue.
Summary by CodeRabbit
New Features
Bug Fixes
Tests