Skip to content

Conversation

@carloea2
Copy link
Contributor

@carloea2 carloea2 commented Jan 6, 2026

What changes were proposed in this PR?

  • Enforce the single_file_upload_max_size_mib limit for multipart uploads at init by requiring fileSizeBytes + partSizeBytes and rejecting when the total declared file size exceeds the configured max.
  • Persist multipart sizing metadata in DB by adding file_size_bytes and part_size_bytes to dataset_upload_session, plus constraints to keep them valid.
  • Harden uploadPart against size bypasses by computing the expected part size from the stored session metadata and rejecting any request whose Content-Length does not exactly match the expected size (including the final part).
  • Add a final server-side safety check at finish: after lakeFS reports the completed object size, compare to the max and roll back the object if it exceeds the limit.
  • Update frontend init call to pass fileSizeBytes and partSizeBytes when initializing multipart uploads.
  • Add DB migration (sql/updates/18.sql) to apply the schema change on existing deployments.

Any related issues, documentation, discussions?

Close #4147

How was this PR tested?

  • Added/updated unit tests for multipart upload validation and malicious cases, including:

    • max upload size enforced at init (over/equals boundaries + 2-part boundary)
    • header poisoning and Content-Length mismatch rejection (non-numeric/overflow/mismatch)
    • finish rollback when max is tightened before finish (oversized object must not remain accessible)

Was this PR authored or co-authored using generative AI tooling?

Co-authored-by: ChatGPT

@github-actions github-actions bot added ddl-change Changes to the TexeraDB DDL fix frontend Changes related to the frontend GUI service labels Jan 6, 2026
@carloea2 carloea2 changed the title v1 fix(dataset): enforce max file size for multipart upload Jan 6, 2026
@carloea2
Copy link
Contributor Author

carloea2 commented Jan 6, 2026

@xuang7 @aicam @chenlica

@chenlica chenlica requested a review from aicam January 6, 2026 04:44
@chenlica
Copy link
Contributor

chenlica commented Jan 6, 2026

@xuang7 Please be the first reviewer before @aicam can do it.

Copy link
Contributor

@xuang7 xuang7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. LGTM!

val prefixBytes: Long = partSizeBytesValue * nMinus1
if (prefixBytes > fileSizeBytesValue) {
throw new WebApplicationException(
"Upload session is inconsistent (prefixBytes > fileSizeBytes). Re-init the upload.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For error messages, consider simplifying them or using “restart the upload” instead of “re-init the upload” for consistency. It may be more intuitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed them, can you let me now if they are still complex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Contributor

@aicam aicam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think life cycle of upload session records needs better design, if needed, we can meet

val fileSizeBytesValue: Long = session.getFileSizeBytes
val partSizeBytesValue: Long = session.getPartSizeBytes

if (fileSizeBytesValue <= 0L) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After failing to upload, the record should be deleted from database, I suggest move error catching logics to a function, and if any of them failed, just remove database records, this way you write the recycling logic once

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Since this is the endpoint for uploading a part, no record is created in this endpoint at this moment, so no record should be deleted here in case of errors, the current logic relies that all the parts rows are created on init phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Regarding the refactor, do you mean each check that throws an exception should have its own function?

case e: DataAccessException
if Option(e.getCause)
.collect { case s: SQLException => s.getSQLState }
.contains("55P03") =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the 55P03 with a constant or something more meaningful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we do that in another PR, since that addition is not from this one.

try LakeFSStorageClient.parsePhysicalAddress(physicalAddr)
catch {
case e: IllegalArgumentException =>
throw new WebApplicationException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In initMultipartUpload, validateAndNormalizeFilePathOrThrow is used to make sure physical address is correct, why we check here again? this is a duplicate logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reason was since file path validation is a very important security measure, the check should go here as well even if it feels repetitive. Please confirm if you still want it to be deleted.

did: Integer,
encodedFilePath: String,
numParts: Optional[Integer],
fileSizeBytes: Optional[java.lang.Long],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why here the parameter receive java.lang.Long and then in line 1487 its converted to Scala? I think its better to receive Scala long in the first place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ddl-change Changes to the TexeraDB DDL fix frontend Changes related to the frontend GUI service

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multipart dataset upload can bypass single_file_upload_max_size_mib limit

4 participants