Spark - Add Spark FunctionCatalog #5377

kbendick · 2022-07-28T22:28:59Z

This PR stems from #5305 and covers just the FunctionCatalog

FunctionCatalog

This allows users of SparkCatalog and SparkSessionCatalog to use functions (such as the iceberg_version function added here) without having to register it as a UDF.

The v2 functions also benefit from code generation and are significantly more efficient. For instance, this registers a logical plan of project "0.15.0-SNAPSHOT" as value).

All Iceberg functions that we register into the function catalog are accessible when used with an Iceberg spark catalog and:

No namespace is referenced - the storage partitioned joins implementation requires this.
e.g. my_catalog.iceberg_version().
The system namespace is referenced, to match called procedure syntax.

** Note ** The session catalog, SparkSessionCatalog or normally named spark_catalog, requires that the namespace being referenced exists.

For the session_catalog, the namespace (the default if none is being used) will be referenced, which will not resolve. To work around this when using the session_catalog in SQL, two options are either:

Referencing spark_catalog.iceberg_version()
Calling system.iceberg_version(). This requires creating a system namespace in the session catalog, but this is the most portable solution for SQL code.

Logic in Spark in Spark Analyzer to verify that the namespace exists when the function is resolved, only for the session catalog.

iceberg_version function

This also adds a simple function iceberg_version, which simply returns the (short) version string. This is mostly for testing but will be useful on its own.

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/IcebergVersionFunction.java

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/TestFunctionCatalog.java

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestIcebergVersionFunction.java

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/SparkTestBaseWithCatalog.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/SparkFunctions.java

.../v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/IcebergVersionFunctionImpl.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/IcebergVersionFunction.java

.../v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/IcebergVersionFunctionImpl.java

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/TestFunctionCatalog.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/IcebergVersionFunction.java

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/TestFunctionCatalog.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/BaseCatalog.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/IcebergVersionFunction.java

aokolnychyi

Looks good to me. I had a few minor suggestions. Thanks for working on this, @kbendick!

kbendick · 2022-07-30T02:52:07Z

Looks good to me. I had a few minor suggestions. Thanks for working on this, @kbendick!

Thanks for the review @aokolnychyi! I made those changes and rebased. This will hopefully help unblock parallelizing work on storage partitioned joins and things.

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/IcebergVersionFunction.java

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/TestFunctionCatalog.java

rdblue · 2022-07-31T20:42:23Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/TestFunctionCatalog.java

+        .anyMatch(func -> "iceberg_version".equalsIgnoreCase(func.name()));
+
+    Assertions.assertThat(asFunctionCatalog.listFunctions(SYSTEM_NAMESPACE))
+        .anyMatch(func -> "iceberg_version".equalsIgnoreCase(func.name()));


Nit: I would not be permissive here. This lets Spark change the case of the function name, which we don't expect and would be really odd.

I chose this as function resolution is case-insensitive in Spark, at least according to the existing code for Procedure's as well as my own investigation. This way, it also matches our current ProcedureCatalog callables. I can update it though.

iceberg/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/procedures/SparkProcedures.java

Lines 34 to 37 in 14f4bc1

public static ProcedureBuilder newBuilder(String name) {

// procedure resolution is case insensitive to match the existing Spark behavior for functions

Supplier<ProcedureBuilder> builderSupplier = BUILDERS.get(name.toLowerCase(Locale.ROOT));

return builderSupplier != null ? builderSupplier.get() : null;

I updated the tests, but built-in functions are case-insensitive by default.

Tests done on Spark 3.3 with spark.sql.caseSensitive set to both true and false (made no difference either way).

scala> spark.sql("SELECT uuid() as _uuid").show() +--------------------+ | _uuid| +--------------------+ |3d52c2c7-225c-44c...| +--------------------+ scala> spark.sql("SELECT UUID() as _uuid").show() +--------------------+ | _uuid| +--------------------+ |1babdfb6-1f71-498...| +--------------------+ scala> spark.sql("SELECT UuID() as _uuid").show() +--------------------+ | _uuid| +--------------------+ |d63892a1-e2e6-49d...| +--------------------+

Yeah, I think Spark functions are case insensitive no matter what.

I updated the tests, but built-in functions are case-insensitive by default.

That doesn't mean that Spark will change the case of names that are supplied by FunctionCatalog. The problem isn't resolution being case insensitive, it is that this test case allows Spark to change the function name. I don't see a reason to allow that.

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/TestFunctionCatalog.java

kbendick · 2022-07-31T23:41:18Z

@rdblue I addressed your most recent comments. Can you please take a look?

rdblue · 2022-08-01T18:43:21Z

Merged. @kbendick, can you backport this to 3.2 as well?

kbendick · 2022-08-01T19:07:30Z

Merged. @kbendick, can you backport this to 3.2 as well?

Working on it now!

kbendick · 2022-08-01T19:23:36Z

New PR for Spark 3.2: #5411

I'll ping people on it once unit tests have finished passing.

github-actions bot added the spark label Jul 28, 2022

kbendick changed the title ~~Spark - Add Spark FunctionCatalog with Basic iceberg_version Function for Testing~~ Spark - Add Spark FunctionCatalog Jul 28, 2022

kbendick mentioned this pull request Jul 28, 2022

Spark - Implement FunctionCatalog and Truncate #5305

Closed

rdblue reviewed Jul 28, 2022

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/IcebergVersionFunction.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 28, 2022

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/IcebergVersionFunction.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 28, 2022

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/IcebergVersionFunction.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 28, 2022

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/IcebergVersionFunction.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 28, 2022

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/TestFunctionCatalog.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 28, 2022

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/TestFunctionCatalog.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 28, 2022

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/TestFunctionCatalog.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 28, 2022

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestIcebergVersionFunction.java Outdated Show resolved Hide resolved