feat(java): introduce overwrite and append in transaction#4327
feat(java): introduce overwrite and append in transaction#4327jackye1995 merged 12 commits intolance-format:mainfrom
Conversation
|
@jackye1995 Hi, This PR is ready for review. A little refactoring on transaction for better generality |
| private long readVersion; | ||
| private Operation operation; | ||
| private Operation blobOp; | ||
| private Operation.Builder<?> operationBuilder; |
There was a problem hiding this comment.
I guess this also works, but trying to understand why we prefer this compared to the original approach of building the operation and then feed to the transaction builder? Seems like we have to introduce <?> here to force ignore the type information and then check it a t runtime like in
if (operationBuilder instanceof Overwrite.Builder) {
operationBuilder =
((Overwrite.Builder) operationBuilder).configUpsertValues(upsertTableConfig);
}
if the whole goal is just to save the logic of things like upserts, it does not feel worth to me this complication
There was a problem hiding this comment.
The whole picture I was thinking is like:
Transaction tx = dataset.newTransactionBuilder()
.upsertTableConfig(..)
.deleteTableConfigKeys(..)
.replaceSchemaConfig(..)
.replaceFieldConfig(..)
.build()
or just
Transaction tx = dataset.newTransactionBuilder()
.upsertTableConfig(..)
.build()
Transaction tx = dataset.newTransactionBuilder()
.overwrite(..)
.upsertTableConfig(..)
.build()
or just
Transaction tx = dataset.newTransactionBuilder()
.overwrite(..)
.build()
I was thinking upsertTableConfig has nothing to do with things like overwrite or replaceSchemaConfig except they could be done in one operation. From upper view, they are not very necessary to be exposed and coupled in one. I thougt this would be more flexible and friendly for callers and make things optional.
There are also some issues like if we call like:
Transaction tx = dataset.newTransactionBuilder()
.upsertTableConfig(..)
.overwrite(..)
.build()
It will throw an multi operation error since upsertTableConfig() will initialize an UpdateConfig operation.
I thought this would be a little bit fussy. I was intending to reveal the idea and discuss it. If feeling not right. Let's just use:
Transaction tx = dataset.newTransactionBuilder()
.updateConfig(xx, xx, xx, xx)
.build()
or just
Transaction tx = dataset.newTransactionBuilder()
.upsertTableConfig(xx, null, null, null)
.build()
Transaction tx = dataset.newTransactionBuilder()
.overwrite(xx, xx)
.build()
or just
Transaction tx = dataset.newTransactionBuilder()
.overwrite(xx, null)
.build()
What do you think?
There was a problem hiding this comment.
I was thinking about something even simpler (at least in my mind), something like
dataset.newTransactionBuilder()
.operation(Overwrite.builder().fragments(...).schema(...).configUpsertValues(...).build())
.build()
.commit()
dataset.newTransactionBuilder()
.operation(Append.builder().fragments(...).build())
.build()
.commit()And when we do multi-transaction, we will do
dataset.newCommitBuilder()
.addTransaction(dataset.newTransactionBuilder()
.operation(Overwrite.builder().fragments(...).schema(...).configUpsertValues(...).build())
.build())
.addTransaction(dataset.newTransactionBuilder()
.operation(Append.builder().fragments(...).build())
.build())
.build()
.commit()There was a problem hiding this comment.
@jackye1995 Hi, I did some refactoring. I believe the interface is clean now. PTAL.
BTW, I organized the logic between allocator and dataset
# Conflicts: # java/core/src/main/java/com/lancedb/lance/Transaction.java # java/core/src/test/java/com/lancedb/lance/TransactionTest.java
| import org.apache.arrow.vector.types.pojo.Schema; | ||
|
|
||
| /** Schema related base operation. */ | ||
| public abstract class SchemaOperation implements Operation { |
There was a problem hiding this comment.
I am usually on the side of no need to optimize too much and create a base class if it's just to save a few lines of code, but I am not too opinionated on that either, up to you.
There was a problem hiding this comment.
The main reason for this SchemaOperation definition is to be used in rust jni(shared by operations that have schema):
fn convert_schema_from_operation(
env: &mut JNIEnv,
java_operation: &JObject,
java_dataset: &JObject,
) -> Result<LanceSchema> {
let java_buffer_allocator = env
.call_method(
java_dataset,
"allocator",
"()Lorg/apache/arrow/memory/BufferAllocator;",
&[],
)?
.l()?;
let schema_ptr = env
.call_method(
java_operation,
"exportSchema",
"(Lorg/apache/arrow/memory/BufferAllocator;)J",
&[JValue::Object(&java_buffer_allocator)],
)?
.j()?;
let c_schema_ptr = schema_ptr as *mut FFI_ArrowSchema;
let c_schema = unsafe { FFI_ArrowSchema::from_raw(c_schema_ptr) };
let schema = Schema::try_from(&c_schema)?;
Ok(
LanceSchema::try_from(&schema)
.expect("Failed to convert from arrow schema to lance schema"),
)
}
I was thinking if the operations doesn't have a common interface, the reused callMethod would not be that safe. And I intended to avoid warnings when we don't reuse this.
Options here:
- Just eliminate the abstract class, leaving them implement
exportSchemaeach. - Make SchemaOperation an interface.
- Keep this
I‘m not really opinionated on this neither. I wrote this in case that you didn't notice the rust code. I'm happy to hear how you are feeling about this.
jackye1995
left a comment
There was a problem hiding this comment.
overall looks good to me
Close #4330
For java binding, there are interfaces in Dataset.java to append and overwrite fragments:
In Rust and Python, append and overwrite operations are uniformly handled through the transaction interface. Java should adopt an equivalent implementation by consolidating these patterns into its transaction framework.
This pull request will:
commit(),commitAsync())