Skip to content

feat(java): brings transaction api to Java module and support project#4219

Merged
jackye1995 merged 7 commits intolance-format:mainfrom
majin1102:transaction_project
Jul 24, 2025
Merged

feat(java): brings transaction api to Java module and support project#4219
jackye1995 merged 7 commits intolance-format:mainfrom
majin1102:transaction_project

Conversation

@majin1102
Copy link
Contributor

Close #4201

@github-actions github-actions bot added enhancement New feature or request java labels Jul 14, 2025
@majin1102 majin1102 force-pushed the transaction_project branch 4 times, most recently from 47a3ff1 to ed42933 Compare July 14, 2025 07:49
uuid,
operation: op,
blobs_op,
tag: message,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tag should be described as tag also in Java, not a new term message

Copy link
Contributor Author

@majin1102 majin1102 Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my conern is that when I set a tag and a dataset into one transaction. how to make people know the tag IS NOT version related and it actually acts like a transaction extra label?
if I understand wrong,please correct me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a confusing part indeed. I think we can start with what Python does, just omit the tag field for now so it is not exposed to end users. Even if we want to commit with a message, as we discussed previously, we want to do that as a part of the Transaction properties workstream instead, so we should not hijack the tag field for that purpose. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a confusing part indeed. I think we can start with what Python does, just omit the tag field for now so it is not exposed to end users. Even if we want to commit with a message, as we discussed previously, we want to do that as a part of the Transaction properties workstream instead, so we should not hijack the tag field for that purpose. What do you think?

Sure, I'm OK with this approach

* @param message The description of the operation.
* @return A new instance of {@link Transaction} linked to the opened dataset.
*/
public Transaction newTransaction(String message) {
Copy link
Contributor

@jackye1995 jackye1995 Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why expose a specific one with message? We can just expose a

public Transaction.Builder newTransactionBuilder() {
    return new Transaction.Builder(this);
  }

Copy link
Contributor Author

@majin1102 majin1102 Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I aim to do is directly get a transaction obj by newTransaction. Then the usage could be simple.

Your suggestion seems more flexible. I will follow it

Copy link
Contributor Author

@majin1102 majin1102 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why expose a specific one with message? We can just expose a

public Transaction.Builder newTransactionBuilder() {
    return new Transaction.Builder(this);
  }

As discussed above, there's no need to expose the api with message.
I think a newTransaction is enough for now. For further properties usage, I think we could just:

 txn = committedDataset.newTransaction()
        .newProject(new Schema(fieldList))
        .setProperties(.....)
        .commit()

This could be concise
WDYT @jackye1995

Copy link
Contributor

@jackye1995 jackye1995 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.newProject(new Schema(fieldList))

what does it mean to have a project for a transaction?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nvm it's the Project operation, I misunderstood

Copy link
Contributor Author

@majin1102 majin1102 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I don't think newXXXX() should be one-one with operation.

For example, right now update_meta is one opereation including updating table config, replacing schema config and replacing field configs. It could be used as

dataset.newTransaction()
     .newUpdateConfig(xxx)
     .newReplaceSchemaConfig(xxx)
     .newReplaceFieldsConfig(xxxx)
     .commit()

When calling like this, the underlying is one operation.

People don't need to know how to construct operations. They could be hidden and Transaction api could provide the better calling way to construct them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest, let's just do the multi-statement case implementation here to see how things work end to end, have 2 different operations and commit them. It would be hard to imagine how things extend to multi-statement with just this PR.

You mean implement more opreations and see the usage in uni-test maybe?

Copy link
Contributor

@jackye1995 jackye1995 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if we can simply what you wrote. Firstly there is only one operation per transaction you don't need a list. Secondly, secondly, we can add syntax sugar to say dataset.newCommit().addTxn(...).addTxn(...).execute. Would that be better?

Copy link
Contributor Author

@majin1102 majin1102 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if we can simply what you wrote. Firstly there is only one operation per transaction you don't need a list. Secondly, secondly, we can add syntax sugar to say dataset.newCommit().addTxn(...).addTxn(...).execute. Would that be better?

I think better than the build() storm code.

  1. dataset.newCommit().execute() is one transaction or multi? I mean the ACID granularity. I think what we want is one execute() one ACID which means one transaction? but what you are calling is addTxn and andTx. That could make things confusing.
  2. The key differenece from mine(maybe we could call Iceberg like?) I think is use Commit to replace Transaction, use transaction to replace operation(which I think cause semantics confusing). and let callers build operations themself. I assume the key point is to make operations more exposed to upper modules for flexibilities concern? If so I think we can discuss the risks that transaction wraps operations. Or we just want to strictly align with rust style as one goal? Or be different from Iceberg(seriously what I am afraid is to be not as good as Iceberg in this case not be like it :()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if we can simply what you wrote. Firstly there is only one operation per transaction you don't need a list. Secondly, secondly, we can add syntax sugar to say dataset.newCommit().addTxn(...).addTxn(...).execute. Would that be better?

One more issue is the read version, addTxn() more than once could introduce more than one read version, which should not be involved in multi-statement trasaction. We could check and throw exceptions. But I think the whole thing is just redundant and unnecessary.

@majin1102 majin1102 force-pushed the transaction_project branch from 1254b0f to d2d0ce0 Compare July 19, 2025 09:34
@majin1102
Copy link
Contributor Author

majin1102 commented Jul 19, 2025

Let's see if we can simply what you wrote. Firstly there is only one operation per transaction you don't need a list. Secondly, secondly, we can add syntax sugar to say dataset.newCommit().addTxn(...).addTxn(...).execute. Would that be better?

What's our next step? I'm ready for this newCommit and addTransaction interface. I think eventually they do the same things except the priority is concept consistency or simplicity. I agree that the former takes priority

return writeParams;
}

public Transaction newProject(Schema schema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we have not fully decided how we willl chain multiple transaction, I feel it is better that we don't do a syntax like this to give an illusion that we can "chain transactions" (We can always add later).

Also typically with a builder people expect the object to be immutable after build, and a newXXX to create a new object that inherits some stats of the source object. But here Transaction txn = dataset.newTransactionBuilder().build().newProject(new Schema(fieldList)); we are actually mutating the state of the already built transaction, that is also a bit confusing.

As a simple first step integration, can we just in the builder do:

Transaction txn = dataset.newTransactionBuilder()
.operation(Project.builder().schema(schema).allocator(dataset.allocator).build())
.build()

what do you think?

Copy link
Contributor Author

@majin1102 majin1102 Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is resonable. I think there's a note for this case:

Transaction txn = dataset.newTransactionBuilder()
.operation(Project.builder(dataset).schema(schema).build())
.build()
  1. Avoid users or engines to set their own allocator or create a new one(avoid danging allocators)
  2. Make the dataset a neligable paramter to build the project operation
  3. The dataset now is necessary for JNI usage

I think this situation may suite some other operations cases. WDYT

Copy link
Contributor Author

@majin1102 majin1102 Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh,never mind

The dataset is necessary for transaction, not operation. Project did need an allocator. but it didn't release.

I think We'd better give it a managed allocator(do not expose,just pass transaction dataset allocator). Or create a new one and release it after commit(creating allocator is heavy as said).

What do you prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think We'd better give it a managed allocator(do not expose,just pass transaction dataset allocator).

+1

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

@jackye1995
Copy link
Contributor

looks like some tests are faling, let me know when that is fixed I can go merge it!

@jackye1995 jackye1995 merged commit bdfb8fc into lance-format:main Jul 24, 2025
8 checks passed
@majin1102 majin1102 deleted the transaction_project branch September 10, 2025 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Brings transaction api for java module

2 participants