feat(java): introduce overwrite and append in transaction by majin1102 · Pull Request #4327 · lance-format/lance

majin1102 · 2025-07-28T16:32:09Z

Close #4330
For java binding, there are interfaces in Dataset.java to append and overwrite fragments:

  public static native Dataset commitAppend(
      String path,
      Optional<Long> readVersion,
      List<FragmentMetadata> fragmentsMetadata,
      Map<String, String> storageOptions);

  public static native Dataset commitOverwrite(
      String path,
      long arrowSchemaMemoryAddress,
      Optional<Long> readVersion,
      List<FragmentMetadata> fragmentsMetadata,
      Map<String, String> storageOptions);

In Rust and Python, append and overwrite operations are uniformly handled through the transaction interface. Java should adopt an equivalent implementation by consolidating these patterns into its transaction framework.

This pull request will:

Deprecate the legacy commit methods (commit(), commitAsync())
Migrate these two operations to the transaction interface

majin1102 · 2025-07-29T04:06:53Z

@jackye1995 Hi, This PR is ready for review.

A little refactoring on transaction for better generality

jackye1995 · 2025-07-29T22:37:56Z

java/core/src/main/java/com/lancedb/lance/Transaction.java

    private long readVersion;
-    private Operation operation;
-    private Operation blobOp;
+    private Operation.Builder<?> operationBuilder;


I guess this also works, but trying to understand why we prefer this compared to the original approach of building the operation and then feed to the transaction builder? Seems like we have to introduce <?> here to force ignore the type information and then check it a t runtime like in

if (operationBuilder instanceof Overwrite.Builder) { operationBuilder = ((Overwrite.Builder) operationBuilder).configUpsertValues(upsertTableConfig); }

if the whole goal is just to save the logic of things like upserts, it does not feel worth to me this complication

The whole picture I was thinking is like:

Transaction tx = dataset.newTransactionBuilder() .upsertTableConfig(..) .deleteTableConfigKeys(..) .replaceSchemaConfig(..) .replaceFieldConfig(..) .build() or just Transaction tx = dataset.newTransactionBuilder() .upsertTableConfig(..) .build() Transaction tx = dataset.newTransactionBuilder() .overwrite(..) .upsertTableConfig(..) .build() or just Transaction tx = dataset.newTransactionBuilder() .overwrite(..) .build()

I was thinking upsertTableConfig has nothing to do with things like overwrite or replaceSchemaConfig except they could be done in one operation. From upper view, they are not very necessary to be exposed and coupled in one. I thougt this would be more flexible and friendly for callers and make things optional.

There are also some issues like if we call like:

Transaction tx = dataset.newTransactionBuilder() .upsertTableConfig(..) .overwrite(..) .build()

It will throw an multi operation error since upsertTableConfig() will initialize an UpdateConfig operation.

I thought this would be a little bit fussy. I was intending to reveal the idea and discuss it. If feeling not right. Let's just use:

Transaction tx = dataset.newTransactionBuilder() .updateConfig(xx, xx, xx, xx) .build() or just Transaction tx = dataset.newTransactionBuilder() .upsertTableConfig(xx, null, null, null) .build() Transaction tx = dataset.newTransactionBuilder() .overwrite(xx, xx) .build() or just Transaction tx = dataset.newTransactionBuilder() .overwrite(xx, null) .build()

What do you think?

I was thinking about something even simpler (at least in my mind), something like

dataset.newTransactionBuilder() .operation(Overwrite.builder().fragments(...).schema(...).configUpsertValues(...).build()) .build() .commit() dataset.newTransactionBuilder() .operation(Append.builder().fragments(...).build()) .build() .commit()

And when we do multi-transaction, we will do

dataset.newCommitBuilder() .addTransaction(dataset.newTransactionBuilder() .operation(Overwrite.builder().fragments(...).schema(...).configUpsertValues(...).build()) .build()) .addTransaction(dataset.newTransactionBuilder() .operation(Append.builder().fragments(...).build()) .build()) .build() .commit()

@jackye1995 Hi, I did some refactoring. I believe the interface is clean now. PTAL.

BTW, I organized the logic between allocator and dataset

# Conflicts: # java/core/src/main/java/com/lancedb/lance/Transaction.java # java/core/src/test/java/com/lancedb/lance/TransactionTest.java

jackye1995 · 2025-08-04T21:58:46Z

java/core/src/main/java/com/lancedb/lance/operation/SchemaOperation.java

+import org.apache.arrow.vector.types.pojo.Schema;
+
+/** Schema related base operation. */
+public abstract class SchemaOperation implements Operation {


I am usually on the side of no need to optimize too much and create a base class if it's just to save a few lines of code, but I am not too opinionated on that either, up to you.

The main reason for this SchemaOperation definition is to be used in rust jni（shared by operations that have schema）:

fn convert_schema_from_operation( env: &mut JNIEnv, java_operation: &JObject, java_dataset: &JObject, ) -> Result<LanceSchema> { let java_buffer_allocator = env .call_method( java_dataset, "allocator", "()Lorg/apache/arrow/memory/BufferAllocator;", &[], )? .l()?; let schema_ptr = env .call_method( java_operation, "exportSchema", "(Lorg/apache/arrow/memory/BufferAllocator;)J", &[JValue::Object(&java_buffer_allocator)], )? .j()?; let c_schema_ptr = schema_ptr as *mut FFI_ArrowSchema; let c_schema = unsafe { FFI_ArrowSchema::from_raw(c_schema_ptr) }; let schema = Schema::try_from(&c_schema)?; Ok( LanceSchema::try_from(&schema) .expect("Failed to convert from arrow schema to lance schema"), ) }

I was thinking if the operations doesn't have a common interface, the reused callMethod would not be that safe. And I intended to avoid warnings when we don't reuse this.

Options here:

Just eliminate the abstract class, leaving them implement exportSchema each.

Make SchemaOperation an interface.

Keep this

I‘m not really opinionated on this neither. I wrote this in case that you didn't notice the rust code. I'm happy to hear how you are feeling about this.

jackye1995

overall looks good to me

github-actions bot added enhancement New feature or request java labels Jul 28, 2025

majin1102 force-pushed the txn-overwrite branch from 72bebc5 to 056189b Compare July 28, 2025 16:47

majin1102 marked this pull request as draft July 28, 2025 18:04

majin1102 force-pushed the txn-overwrite branch from 8c4783d to 9d61857 Compare July 28, 2025 18:08

feat(java): introduce overwrite and append in transaction

f1a3319

majin1102 force-pushed the txn-overwrite branch from d9a1aa9 to f1a3319 Compare July 29, 2025 04:02

Merge branch 'main' into txn-overwrite

e674281

majin1102 marked this pull request as ready for review July 29, 2025 04:02

majin1102 mentioned this pull request Jul 29, 2025

Epic: Improving integration with Java ecosystem #3950

Closed

jackye1995 reviewed Jul 29, 2025

View reviewed changes

majin1102 force-pushed the txn-overwrite branch from 2627c92 to 19dc6b1 Compare July 30, 2025 12:10

optimize interface

e0bbaf7

majin1102 force-pushed the txn-overwrite branch from 19dc6b1 to e0bbaf7 Compare July 30, 2025 12:11

majin1102 added 3 commits July 30, 2025 20:11

Merge branch 'main' into txn-overwrite

dcca9be

use dataset.allocator()

6be3cee

make dataset.allocator = more reasonable

c82ba57

majin1102 mentioned this pull request Jul 30, 2025

Java transaction supports CreateIndex operation #4334

Closed

majin1102 added 4 commits August 1, 2025 00:54

Merge branch 'main' into txn-overwrite

7f2f8c3

Merge remote-tracking branch 'majin/main' into txn-overwrite

ee4063c

# Conflicts: # java/core/src/main/java/com/lancedb/lance/Transaction.java # java/core/src/test/java/com/lancedb/lance/TransactionTest.java

fix issue in merging

47fbc23

cargo fmt --all

1e8d947

majin1102 mentioned this pull request Aug 4, 2025

Java dataset support readTransaction #4382

Closed

jackye1995 reviewed Aug 4, 2025

View reviewed changes

jackye1995 approved these changes Aug 4, 2025

View reviewed changes

majin1102 added 2 commits August 5, 2025 14:44

Merge branch 'main' into txn-overwrite

12e24e1

Merge branch 'main' into txn-overwrite

6db40ed

jackye1995 merged commit 6ae7026 into lance-format:main Aug 5, 2025
8 checks passed

majin1102 deleted the txn-overwrite branch September 10, 2025 07:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(java): introduce overwrite and append in transaction#4327

feat(java): introduce overwrite and append in transaction#4327
jackye1995 merged 12 commits intolance-format:mainfrom
majin1102:txn-overwrite

majin1102 commented Jul 28, 2025 •

edited

Loading

Uh oh!

majin1102 commented Jul 29, 2025

Uh oh!

jackye1995 Jul 29, 2025

Uh oh!

majin1102 Jul 30, 2025 •

edited

Loading

Uh oh!

jackye1995 Jul 30, 2025

Uh oh!

majin1102 Jul 30, 2025

Uh oh!

jackye1995 Aug 4, 2025

Uh oh!

majin1102 Aug 5, 2025 •

edited

Loading

Uh oh!

jackye1995 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

majin1102 commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

majin1102 commented Jul 29, 2025

Uh oh!

jackye1995 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

majin1102 Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

majin1102 Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

jackye1995 Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

majin1102 Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

majin1102 commented Jul 28, 2025 •

edited

Loading

majin1102 Jul 30, 2025 •

edited

Loading

majin1102 Aug 5, 2025 •

edited

Loading