Skip to content

feat(java): introduce overwrite and append in transaction#4327

Merged
jackye1995 merged 12 commits intolance-format:mainfrom
majin1102:txn-overwrite
Aug 5, 2025
Merged

feat(java): introduce overwrite and append in transaction#4327
jackye1995 merged 12 commits intolance-format:mainfrom
majin1102:txn-overwrite

Conversation

@majin1102
Copy link
Contributor

@majin1102 majin1102 commented Jul 28, 2025

Close #4330
For java binding, there are interfaces in Dataset.java to append and overwrite fragments:

  public static native Dataset commitAppend(
      String path,
      Optional<Long> readVersion,
      List<FragmentMetadata> fragmentsMetadata,
      Map<String, String> storageOptions);

  public static native Dataset commitOverwrite(
      String path,
      long arrowSchemaMemoryAddress,
      Optional<Long> readVersion,
      List<FragmentMetadata> fragmentsMetadata,
      Map<String, String> storageOptions);

In Rust and Python, append and overwrite operations are uniformly handled through the transaction interface. Java should adopt an equivalent implementation by consolidating these patterns into its transaction framework.

This pull request will:

  1. Deprecate the legacy commit methods (commit(), commitAsync())
  2. Migrate these two operations to the transaction interface

@github-actions github-actions bot added enhancement New feature or request java labels Jul 28, 2025
@majin1102 majin1102 marked this pull request as draft July 28, 2025 18:04
@majin1102 majin1102 marked this pull request as ready for review July 29, 2025 04:02
@majin1102
Copy link
Contributor Author

@jackye1995 Hi, This PR is ready for review.

A little refactoring on transaction for better generality

private long readVersion;
private Operation operation;
private Operation blobOp;
private Operation.Builder<?> operationBuilder;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this also works, but trying to understand why we prefer this compared to the original approach of building the operation and then feed to the transaction builder? Seems like we have to introduce <?> here to force ignore the type information and then check it a t runtime like in

if (operationBuilder instanceof Overwrite.Builder) {
        operationBuilder =
            ((Overwrite.Builder) operationBuilder).configUpsertValues(upsertTableConfig);
      }

if the whole goal is just to save the logic of things like upserts, it does not feel worth to me this complication

Copy link
Contributor Author

@majin1102 majin1102 Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole picture I was thinking is like:

Transaction tx = dataset.newTransactionBuilder()
    .upsertTableConfig(..)
    .deleteTableConfigKeys(..)
    .replaceSchemaConfig(..)
    .replaceFieldConfig(..)
    .build()
or just
Transaction tx = dataset.newTransactionBuilder()
    .upsertTableConfig(..)
    .build()

Transaction tx = dataset.newTransactionBuilder()
    .overwrite(..)
    .upsertTableConfig(..)
    .build()
or just
Transaction tx = dataset.newTransactionBuilder()
    .overwrite(..)
    .build()

I was thinking upsertTableConfig has nothing to do with things like overwrite or replaceSchemaConfig except they could be done in one operation. From upper view, they are not very necessary to be exposed and coupled in one. I thougt this would be more flexible and friendly for callers and make things optional.

There are also some issues like if we call like:

Transaction tx = dataset.newTransactionBuilder()
    .upsertTableConfig(..)
    .overwrite(..)
    .build()

It will throw an multi operation error since upsertTableConfig() will initialize an UpdateConfig operation.

I thought this would be a little bit fussy. I was intending to reveal the idea and discuss it. If feeling not right. Let's just use:

Transaction tx = dataset.newTransactionBuilder()
    .updateConfig(xx, xx, xx, xx)
    .build()
or just
Transaction tx = dataset.newTransactionBuilder()
    .upsertTableConfig(xx, null, null, null)
    .build()

Transaction tx = dataset.newTransactionBuilder()
    .overwrite(xx, xx)
    .build()
or just
Transaction tx = dataset.newTransactionBuilder()
    .overwrite(xx, null)
    .build()

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about something even simpler (at least in my mind), something like

dataset.newTransactionBuilder()
  .operation(Overwrite.builder().fragments(...).schema(...).configUpsertValues(...).build())
  .build()
  .commit()

dataset.newTransactionBuilder()
  .operation(Append.builder().fragments(...).build())
  .build()
  .commit()

And when we do multi-transaction, we will do

dataset.newCommitBuilder()
  .addTransaction(dataset.newTransactionBuilder()
      .operation(Overwrite.builder().fragments(...).schema(...).configUpsertValues(...).build())
      .build())
  .addTransaction(dataset.newTransactionBuilder()
      .operation(Append.builder().fragments(...).build())
      .build())
  .build()
  .commit()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 Hi, I did some refactoring. I believe the interface is clean now. PTAL.

BTW, I organized the logic between allocator and dataset

# Conflicts:
#	java/core/src/main/java/com/lancedb/lance/Transaction.java
#	java/core/src/test/java/com/lancedb/lance/TransactionTest.java
import org.apache.arrow.vector.types.pojo.Schema;

/** Schema related base operation. */
public abstract class SchemaOperation implements Operation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am usually on the side of no need to optimize too much and create a base class if it's just to save a few lines of code, but I am not too opinionated on that either, up to you.

Copy link
Contributor Author

@majin1102 majin1102 Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason for this SchemaOperation definition is to be used in rust jni(shared by operations that have schema):

fn convert_schema_from_operation(
    env: &mut JNIEnv,
    java_operation: &JObject,
    java_dataset: &JObject,
) -> Result<LanceSchema> {
    let java_buffer_allocator = env
        .call_method(
            java_dataset,
            "allocator",
            "()Lorg/apache/arrow/memory/BufferAllocator;",
            &[],
        )?
        .l()?;
    let schema_ptr = env
        .call_method(
            java_operation,
            "exportSchema",
            "(Lorg/apache/arrow/memory/BufferAllocator;)J",
            &[JValue::Object(&java_buffer_allocator)],
        )?
        .j()?;
    let c_schema_ptr = schema_ptr as *mut FFI_ArrowSchema;
    let c_schema = unsafe { FFI_ArrowSchema::from_raw(c_schema_ptr) };
    let schema = Schema::try_from(&c_schema)?;
    Ok(
        LanceSchema::try_from(&schema)
            .expect("Failed to convert from arrow schema to lance schema"),
    )
}

I was thinking if the operations doesn't have a common interface, the reused callMethod would not be that safe. And I intended to avoid warnings when we don't reuse this.

Options here:

  1. Just eliminate the abstract class, leaving them implement exportSchema each.
  2. Make SchemaOperation an interface.
  3. Keep this

I‘m not really opinionated on this neither. I wrote this in case that you didn't notice the rust code. I'm happy to hear how you are feeling about this.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good to me

@jackye1995 jackye1995 merged commit 6ae7026 into lance-format:main Aug 5, 2025
8 checks passed
@majin1102 majin1102 deleted the txn-overwrite branch September 10, 2025 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Java transaction supports Overwrite and Append operations

2 participants