[#1573] feat(spark-connector): Spark regression test system (#3933)

### What changes were proposed in this pull request? add Spark regression test system to do Regression test for SparkSQL. - hive, contains SQLs in https://datastrato.ai/docs/0.5.1/spark-connector/spark-catalog-hive - iceberg, contains SQLs in https://datastrato.ai/docs/0.5.1/spark-connector/spark-catalog-iceberg - tpcds, contains all queries, and the data is about scala 0.01 to reduce the size. ### Why are the changes needed? Fix: #1573 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add IT
apache · Aug 2, 2024 · f288d91 · f288d91
1 parent e8cdae3
commit f288d91
Show file tree

Hide file tree

Showing 252 changed files with 471,973 additions and 0 deletions.
diff --git a/build.gradle.kts b/build.gradle.kts
@@ -472,6 +472,7 @@ tasks.rat {
     "integration-test/**/*.sql",
     "integration-test/**/*.txt",
     "docs/**/*.md",
+    "spark-connector/spark-common/src/test/resources/**",
     "web/.**",
     "web/next-env.d.ts",
     "web/dist/**/*",

diff --git a/docs/spark-connector/spark-integration-test.md b/docs/spark-connector/spark-integration-test.md
@@ -0,0 +1,46 @@
+---
+title: "Apache Gravitino Spark connector integration test"
+slug: /spark-connector/spark-connector-integration-test
+keyword: spark connector integration test
+license: "This software is licensed under the Apache License version 2."
+---
+
+## Overview
+
+There are two types of integration tests in spark connector, normal integration test like `SparkXXCatalogIT`, and the golden file integration test. 
+
+## Normal integration test
+
+Normal integration test are mainly used to test the correctness of the metadata, it's enabled in the GitHub CI. You could run tests with specific Spark version like:
+
+```
+./gradlew :spark-connector:spark3.3:test --tests "org.apache.gravitino.spark.connector.integration.test.hive.SparkHiveCatalogIT33.testCreateHiveFormatPartitionTable"
+```
+
+## Golden file integration test
+
+Golden file integration test are mainly to test the correctness of the SQL result with massive data, it's disabled in the GitHub CI, you could run tests with following command:
+
+```
+./gradlew :spark-connector:spark-3.3:test --tests  "org.apache.gravitino.spark.connector.integration.test.sql.SparkSQLRegressionTest33" -PenableSparkSQLITs
+```
+
+Please change the Spark version number if you want to test other Spark versions.
+If you want to change the test behaviour, please modify `spark-connector/spark-common/src/test/resources/spark-test.conf`.
+
+| Configuration item                         | Description                                                                                                                                                                            | Default value                                        | Required | Since Version |
+|--------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------|----------|---------------|
+| `gravitino.spark.test.dir`                 | The Spark SQL test base dir, include `test-sqls` and `data`.                                                                                                                           | `spark-connector/spark-common/src/test/resources/`   | No       | 0.6.0         |
+| `gravitino.spark.test.sqls`                | Specify the test SQLs, using directory to specify group of SQLs like `test-sqls/hive`, using file path to specify one SQL like `test-sqls/hive/basic.sql`, use `,` to split multi part | run all SQLs                                         | No       | 0.6.0         |
+| `gravitino.spark.test.generateGoldenFiles` | Whether generate golden files which are used to check the correctness of the SQL result                                                                                                | false                                                | No       | 0.6.0         |
+| `gravitino.spark.test.metalake`            | The metalake name to run the test                                                                                                                                                      | `test`                                               | No       | 0.6.0         |
+| `gravitino.spark.test.setupEnv`            | Whether to setup Gravitino and Hive environment                                                                                                                                        | `false`                                              | No       | 0.6.0         |
+| `gravitino.spark.test.uri`                 | Gravitino uri address, only available when `gravitino.spark.test.setupEnv` is false                                                                                                    | http://127.0.0.1:8090                                | No       | 0.6.0         |
+| `gravitino.spark.test.iceberg.warehouse`   | The warehouse location, only available when `gravitino.spark.test.setupEnv` is false                                                                                                   | hdfs://127.0.0.1:9000/user/hive/warehouse-spark-test | No       | 0.6.0         |
+
+The test SQL files are located in `spark-connector/spark-common/src/test/resources/` by default. There are three directories:
+- `hive`, SQL tests for Hive catalog.
+- `lakehouse-iceberg`, SQL tests for Iceberg catalog.
+- `tpcds`, SQL tests for `tpcds` in Hive catalog.
+
+You could create a simple SQL file, like `hive/catalog.sql`, the program will check the output with `hive/catalog.sql.out`. For complex cases like `tpcds`, you could do some prepare work like create table&load data in `prepare.sql`.
diff --git a/spark-connector/spark-common/build.gradle.kts b/spark-connector/spark-common/build.gradle.kts
@@ -155,6 +155,14 @@ tasks.clean {
   delete("spark-warehouse")
 }
 
+sourceSets {
+  named("test") {
+    resources {
+      exclude("**/*")
+    }
+  }
+}
+
 val testJar by tasks.registering(Jar::class) {
   archiveClassifier.set("tests")
   from(sourceSets["test"].output)

diff --git a/.../src/test/java/org/apache/gravitino/spark/connector/integration/test/sql/CatalogType.java b/.../src/test/java/org/apache/gravitino/spark/connector/integration/test/sql/CatalogType.java
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.gravitino.spark.connector.integration.test.sql;
+
+import java.util.Locale;
+
+public enum CatalogType {
+  HIVE,
+  ICEBERG,
+  UNKNOWN;
+
+  public static CatalogType fromString(String str) {
+    if (str == null) {
+      return UNKNOWN;
+    }
+    switch (str.toLowerCase(Locale.ROOT)) {
+      case "hive":
+        return HIVE;
+      case "lakehouse-iceberg":
+        return ICEBERG;
+      default:
+        return UNKNOWN;
+    }
+  }
+
+  // The first non-unknown CatalogType from parent to child determines the catalog type.
+  public static CatalogType merge(CatalogType parentCatalogType, CatalogType childCatalogType) {
+    return parentCatalogType.equals(UNKNOWN) ? childCatalogType : parentCatalogType;
+  }
+}
diff --git a/.../src/test/java/org/apache/gravitino/spark/connector/integration/test/sql/QueryOutput.java b/.../src/test/java/org/apache/gravitino/spark/connector/integration/test/sql/QueryOutput.java
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.gravitino.spark.connector.integration.test.sql;
+
+import lombok.Getter;
+
+/** The SQL execution output, include schemas and output */
+@Getter
+public final class QueryOutput {
+  private final String sql;
+  private final String schema;
+  private final String output;
+
+  public QueryOutput(String sql, String schema, String output) {
+    this.sql = sql;
+    this.schema = schema;
+    this.output = output;
+  }
+
+  @Override
+  public String toString() {
+    return "-- !query\n"
+        + sql
+        + "\n"
+        + "-- !query schema\n"
+        + schema
+        + "\n"
+        + "-- !query output\n"
+        + output;
+  }
+}
diff --git a/...st/java/org/apache/gravitino/spark/connector/integration/test/sql/SQLQueryTestHelper.java b/...st/java/org/apache/gravitino/spark/connector/integration/test/sql/SQLQueryTestHelper.java
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.gravitino.spark.connector.integration.test.sql;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.function.Supplier;
+import java.util.stream.Collectors;
+import org.apache.commons.lang3.tuple.Pair;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.execution.HiveResult;
+import org.apache.spark.sql.execution.SQLExecution;
+import org.apache.spark.sql.types.StructType;
+import scala.Option;
+import scala.collection.JavaConverters;
+
+public class SQLQueryTestHelper {
+  private static final String notIncludedMsg = "[not included in comparison]";
+  private static final String clsName = SQLQueryTestHelper.class.getCanonicalName();
+  private static final String emptySchema = new StructType().catalogString();
+
+  private static String replaceNotIncludedMsg(String line) {
+    line =
+        line.replaceAll("#\\d+", "#x")
+            .replaceAll("plan_id=\\d+", "plan_id=x")
+            .replaceAll(
+                "Location.*" + clsName + "/", "Location " + notIncludedMsg + "/{warehouse_dir}/")
+            .replaceAll("file:[^\\s,]*" + clsName, "file:" + notIncludedMsg + "/{warehouse_dir}")
+            .replaceAll("Created By.*", "Created By " + notIncludedMsg)
+            .replaceAll("Created Time.*", "Created Time " + notIncludedMsg)
+            .replaceAll("Last Access.*", "Last Access " + notIncludedMsg)
+            .replaceAll("Partition Statistics\t\\d+", "Partition Statistics\t" + notIncludedMsg)
+            .replaceAll("\\s+$", "")
+            .replaceAll("\\*\\(\\d+\\) ", "*");
+    return line;
+  }
+
+  public static Pair<String, List<String>> getNormalizedResult(SparkSession session, String sql) {
+    Dataset<Row> df = session.sql(sql);
+    String schema = df.schema().catalogString();
+    List<String> answer =
+        SQLExecution.withNewExecutionId(
+            df.queryExecution(),
+            Option.apply(""),
+            () ->
+                JavaConverters.seqAsJavaList(
+                        HiveResult.hiveResultString(df.queryExecution().executedPlan()))
+                    .stream()
+                    .map(s -> replaceNotIncludedMsg(s))
+                    .filter(s -> !s.isEmpty())
+                    .collect(Collectors.toList()));
+
+    Collections.sort(answer);
+
+    return Pair.of(schema, answer);
+  }
+
+  // Different Spark version may produce different exceptions, so here just produce
+  // [SPARK_EXCEPTION]
+  public static Pair<String, List<String>> handleExceptions(
+      Supplier<Pair<String, List<String>>> result) {
+    try {
+      return result.get();
+    } catch (Throwable e) {
+      return Pair.of(emptySchema, Arrays.asList("[SPARK_EXCEPTION]"));
+    }
+  }
+}