Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Identity Columns APIs in DeltaColumnBuilder #2857

Merged
merged 12 commits into from
Apr 30, 2024

Conversation

c27kwan
Copy link
Contributor

@c27kwan c27kwan commented Apr 8, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

This PR is part of #1959

  • We introduce generateAlwaysAsIdentity and generatedByDefaultAsIdentityAPIs into DeltaColumnBuilder so that users can create Delta table with Identity column.
  • We guard the creation of identity column tables with a feature flag until development is complete.

How was this patch tested?

New tests.

Does this PR introduce any user-facing changes?

Yes, we introduce generateAlwaysAsIdentity and generatedByDefaultAsIdentity interfaces to DeltaColumnBuilder for creating identity columns.
Interfaces

def generatedAlwaysAsIdentity(): DeltaColumnBuilder
def generatedAlwaysAsIdentity(start: Long, step: Long): DeltaColumnBuilder
def generatedByDefaultAsIdentity(): DeltaColumnBuilder
def generatedByDefaultAsIdentity(start: Long, step: Long): DeltaColumnBuilder

When the start and the step parameters are not specified, they default to 1L. generatedByDefaultAsIdentity allows users to insert values into the column while a column specified withgeneratedAlwaysAsIdentity can only ever have system generated values.

Example Usage

// Creates a Delta identity column.
io.delta.tables.DeltaTable.columnBuilder(spark, "id")
      .dataType(LongType)
      .generatedAlwaysAsIdentity()
// Which is equivalent to the call
io.delta.tables.DeltaTable.columnBuilder(spark, "id")
      .dataType(LongType)
      .generatedAlwaysAsIdentity(start = 1L, step = 1L)

@c27kwan c27kwan changed the title [Spark] Identity Column Scala apis + flag [Spark] Identity Column APIs in DeltaColumnBuilder + TableFeature Apr 8, 2024
@c27kwan
Copy link
Contributor Author

c27kwan commented Apr 8, 2024

I split the table feature part into a separate PR, which will need to be merged first: #2859

@c27kwan c27kwan changed the title [Spark] Identity Column APIs in DeltaColumnBuilder + TableFeature [Spark] Identity Columns APIs in DeltaColumnBuilder Apr 8, 2024
Copy link
Contributor

@larsk-db larsk-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

*
* @param start the start value of the identity column
* @param step the increment step of the identity column
* @since 3.2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, glad to see this PR so quickly. Are you planning to release it in 3.2.0?
Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a stretch goal at the moment, since the release for Delta 3.2 is quite soon. I should know by end of week whether I can realistically make 3.2. I plan to leave this PR up for at least one week anyways. :)

It's safe to start merging the code even if we can only make 4.0 since the DeltaSQLConf guards creating identity columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, this PR is on the critical path for identity columns implementation in Delta 3.2. We need an interface to specify identity columns. The SQL syntax can only be introduced in the next Spark release (Spark 4), which means we can only have it for Delta 4.0 earliest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Just for my knowledge, the required changes in Spark 4 are only to support SQL syntax, meaning Scala could be supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's part of what I'm assessing this week in my prototype. To put the UDF for generating identity column values in only the Delta repo requires adding workarounds so that it works on Spark 3.5. Once I have enough confidence about the idea, I'll post a PR.

Everything else doesn't seem to require Spark changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to make Delta 3.2. I've discussed with @tdas offline and we're aiming for the next Delta version, Delta 3.3. The API doc comment in this PR has been adjusted accordingly.

}

def identityColumnDataTypeNotSupported(): Throwable = {
new AnalysisException("Identity column does not support this data type.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to tell the user what is the supported type and the unsupported type :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and it seems the other existing error message don't provide the list of supported types. It would be more consistent to not report the supported types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's OK.

@c27kwan c27kwan requested a review from ala April 15, 2024 15:03
@c27kwan c27kwan force-pushed the scala-api branch 2 times, most recently from e1fafec to 8ab57a0 Compare April 15, 2024 15:20
Copy link
Contributor

@ala ala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Thank you!

@allisonport-db allisonport-db merged commit 44b76fa into delta-io:master Apr 30, 2024
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants