feat(java): brings transaction api to Java module and support project#4219
feat(java): brings transaction api to Java module and support project#4219jackye1995 merged 7 commits intolance-format:mainfrom
Conversation
47a3ff1 to
ed42933
Compare
| uuid, | ||
| operation: op, | ||
| blobs_op, | ||
| tag: message, |
There was a problem hiding this comment.
tag should be described as tag also in Java, not a new term message
There was a problem hiding this comment.
my conern is that when I set a tag and a dataset into one transaction. how to make people know the tag IS NOT version related and it actually acts like a transaction extra label?
if I understand wrong,please correct me.
There was a problem hiding this comment.
Yeah this is a confusing part indeed. I think we can start with what Python does, just omit the tag field for now so it is not exposed to end users. Even if we want to commit with a message, as we discussed previously, we want to do that as a part of the Transaction properties workstream instead, so we should not hijack the tag field for that purpose. What do you think?
There was a problem hiding this comment.
Yeah this is a confusing part indeed. I think we can start with what Python does, just omit the
tagfield for now so it is not exposed to end users. Even if we want to commit with a message, as we discussed previously, we want to do that as a part of the Transaction properties workstream instead, so we should not hijack the tag field for that purpose. What do you think?
Sure, I'm OK with this approach
| * @param message The description of the operation. | ||
| * @return A new instance of {@link Transaction} linked to the opened dataset. | ||
| */ | ||
| public Transaction newTransaction(String message) { |
There was a problem hiding this comment.
why expose a specific one with message? We can just expose a
public Transaction.Builder newTransactionBuilder() {
return new Transaction.Builder(this);
}
There was a problem hiding this comment.
I think what I aim to do is directly get a transaction obj by newTransaction. Then the usage could be simple.
Your suggestion seems more flexible. I will follow it
There was a problem hiding this comment.
why expose a specific one with message? We can just expose a
public Transaction.Builder newTransactionBuilder() { return new Transaction.Builder(this); }
As discussed above, there's no need to expose the api with message.
I think a newTransaction is enough for now. For further properties usage, I think we could just:
txn = committedDataset.newTransaction()
.newProject(new Schema(fieldList))
.setProperties(.....)
.commit()
This could be concise
WDYT @jackye1995
There was a problem hiding this comment.
.newProject(new Schema(fieldList))
what does it mean to have a project for a transaction?
There was a problem hiding this comment.
oh nvm it's the Project operation, I misunderstood
There was a problem hiding this comment.
And I don't think newXXXX() should be one-one with operation.
For example, right now update_meta is one opereation including updating table config, replacing schema config and replacing field configs. It could be used as
dataset.newTransaction()
.newUpdateConfig(xxx)
.newReplaceSchemaConfig(xxx)
.newReplaceFieldsConfig(xxxx)
.commit()
When calling like this, the underlying is one operation.
People don't need to know how to construct operations. They could be hidden and Transaction api could provide the better calling way to construct them.
There was a problem hiding this comment.
I would suggest, let's just do the multi-statement case implementation here to see how things work end to end, have 2 different operations and commit them. It would be hard to imagine how things extend to multi-statement with just this PR.
You mean implement more opreations and see the usage in uni-test maybe?
There was a problem hiding this comment.
Let's see if we can simply what you wrote. Firstly there is only one operation per transaction you don't need a list. Secondly, secondly, we can add syntax sugar to say dataset.newCommit().addTxn(...).addTxn(...).execute. Would that be better?
There was a problem hiding this comment.
Let's see if we can simply what you wrote. Firstly there is only one operation per transaction you don't need a list. Secondly, secondly, we can add syntax sugar to say dataset.newCommit().addTxn(...).addTxn(...).execute. Would that be better?
I think better than the build() storm code.
- dataset.newCommit().execute() is one transaction or multi? I mean the ACID granularity. I think what we want is one execute() one ACID which means one transaction? but what you are calling is addTxn and andTx. That could make things confusing.
- The key differenece from mine(maybe we could call Iceberg like?) I think is use Commit to replace Transaction, use transaction to replace operation(which I think cause semantics confusing). and let callers build operations themself. I assume the key point is to make operations more exposed to upper modules for flexibilities concern? If so I think we can discuss the risks that transaction wraps operations. Or we just want to strictly align with rust style as one goal? Or be different from Iceberg(seriously what I am afraid is to be not as good as Iceberg in this case not be like it :()
There was a problem hiding this comment.
Let's see if we can simply what you wrote. Firstly there is only one operation per transaction you don't need a list. Secondly, secondly, we can add syntax sugar to say dataset.newCommit().addTxn(...).addTxn(...).execute. Would that be better?
One more issue is the read version, addTxn() more than once could introduce more than one read version, which should not be involved in multi-statement trasaction. We could check and throw exceptions. But I think the whole thing is just redundant and unnecessary.
java/core/src/main/java/com/lancedb/lance/operation/Projection.java
Outdated
Show resolved
Hide resolved
java/core/src/main/java/com/lancedb/lance/operation/Project.java
Outdated
Show resolved
Hide resolved
java/core/src/main/java/com/lancedb/lance/operation/Project.java
Outdated
Show resolved
Hide resolved
1254b0f to
d2d0ce0
Compare
What's our next step? I'm ready for this newCommit and addTransaction interface. I think eventually they do the same things except the priority is concept consistency or simplicity. I agree that the former takes priority |
| return writeParams; | ||
| } | ||
|
|
||
| public Transaction newProject(Schema schema) { |
There was a problem hiding this comment.
Given that we have not fully decided how we willl chain multiple transaction, I feel it is better that we don't do a syntax like this to give an illusion that we can "chain transactions" (We can always add later).
Also typically with a builder people expect the object to be immutable after build, and a newXXX to create a new object that inherits some stats of the source object. But here Transaction txn = dataset.newTransactionBuilder().build().newProject(new Schema(fieldList)); we are actually mutating the state of the already built transaction, that is also a bit confusing.
As a simple first step integration, can we just in the builder do:
Transaction txn = dataset.newTransactionBuilder()
.operation(Project.builder().schema(schema).allocator(dataset.allocator).build())
.build()what do you think?
There was a problem hiding this comment.
I think this is resonable. I think there's a note for this case:
Transaction txn = dataset.newTransactionBuilder()
.operation(Project.builder(dataset).schema(schema).build())
.build()
- Avoid users or engines to set their own allocator or create a new one(avoid danging allocators)
- Make the dataset a neligable paramter to build the project operation
- The dataset now is necessary for JNI usage
I think this situation may suite some other operations cases. WDYT
There was a problem hiding this comment.
oh,never mind
The dataset is necessary for transaction, not operation. Project did need an allocator. but it didn't release.
I think We'd better give it a managed allocator(do not expose,just pass transaction dataset allocator). Or create a new one and release it after commit(creating allocator is heavy as said).
What do you prefer?
There was a problem hiding this comment.
I think We'd better give it a managed allocator(do not expose,just pass transaction dataset allocator).
+1
|
looks like some tests are faling, let me know when that is fixed I can go merge it! |
Close #4201