Hive: Fix join issues when CBO is enabled #2052

qphien · 2021-01-08T09:54:06Z

After enabling CBO in Hive, there are some issues on MR when two iceberg table are joined. For example:

Cannot find field from inspector
ArrayIndexOutOfBoundsException when getting values from GenericRecord

These issues also happen with iceberg - non-iceberg table joins.

pvary · 2021-01-08T10:49:06Z

mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java

-    if (configuration.get(InputFormatConfig.TABLE_SCHEMA) != null) {
-      tableSchema = SchemaParser.fromJson(configuration.get(InputFormatConfig.TABLE_SCHEMA));
-    } else if (serDeProperties.get(InputFormatConfig.TABLE_SCHEMA) != null) {
+    if (serDeProperties.get(InputFormatConfig.TABLE_SCHEMA) != null) {


Is this needed?
The original intent of the change was that we have the table schema at hand on the mappers/reducers. If we remove this then every mapper/reducer has to read the table once to get the schema.

As show below:

Mapper handles input split belongs to one table:
https://github.com/apache/hive/blob/113f6af7528f016bf821f7a746bad496cc93f834/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L406

Function copyTableJobPropertiesToConf copies 'iceberg.mr.table.schema' property to jobConf:
https://github.com/apache/hive/blob/113f6af7528f016bf821f7a746bad496cc93f834/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L2427-L2443

if we join two tables, only one iceberg schema exists in jobConf which leads wrong inspector:

empty inspector(non-overlap in two table selected columns, e.g.
SELECT o.order_id, o.customer_id, o.total, p.name FROM default.orders o JOIN default.products p ON o.product_id = p.id ORDER BY o.order_id
selected columns in table default.orders: [order_id, total, customer_id, product_id]
selected columns in table default.products: [id, name])

incomplete inspector(overlap in two table selected columns, e.g.
SELECT c.first_name, o.order_id FROM default.orders o JOIN default.customers c ON o.customer_id = c.customer_id ORDER BY o.order_id DESC
selected columns in table default.customers: [first_name, customer_id]
selected columns in table default.orders: [order_id, customer_id]

Thanks for the explanation! This is something @marton-bod already tried his hands on. We might have to use the table identifier to prefix/postfix the schema....

Hi @marton-bod , is there any thread work on the proposal @pvary mentioned above? I can not find any related issue on github.
Thanks.

Hi @qphien . Indeed, we have had a similar problem before in #1708.
My initial solution to fix it was to add the schema into the config object with a prefix based on the table identifier. For example, when joining together default.orders and default.customers, you'd have two properties in the config: default.orders.iceberg.mr.table.schema and default.customers.iceberg.mr.table.schema. This should allow you to add the schema for multiple tables without collisions/overwrites.

Eventually we ended up not adding in this fix, but instead just reverted the commit that caused the problem for us. However, this prefix solution can be one way to make things work.

I think getting table schema from serDeProperties has the same effect as prefix solution. iceberg.mr.table.schema in serDeProperties is overlayed from corresponding tableDesc, so we can get correct table schema when HiveIcebergSerde is initialized.

pvary · 2021-01-08T10:49:50Z

mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergSerDe.java

+    // When same table is joined multiple times, it is possible some selected columns are duplicated,
+    // in this case wrong recordStructField position leads wrong value or ArrayIndexOutOfBoundException
+    String[] distinctSelectedColumns = Arrays.stream(selectedColumns).distinct().toArray(String[]::new);
+    Schema projectedSchema = distinctSelectedColumns.length > 0 ?
+            tableSchema.select(distinctSelectedColumns) : tableSchema;
+    // the input split mapper handles does not belong to this table
+    // it is necessary to ensure projectedSchema equals to tableSchema,
+    // or we cannot find selectOperator's column from inspector
+    if (projectedSchema.columns().size() != distinctSelectedColumns.length) {
+      projectedSchema = tableSchema;
+    }


@marton-bod: Could you please take a look? You know more about the schema projection.
Thanks,
Peter

Looks good generally, but I wanted to clarify this comment:
// the input split mapper handles does not belong to this table // it is necessary to ensure projectedSchema equals to tableSchema, // or we cannot find selectOperator's column from inspector
Just for my understanding, can you give an example in what scenario we could face this issue where the Schema.select() gives back a different number of columns?

With test case testSelectedColumnsOverlapJoin, assuming that mapper is handling split belongs to table default.orders, columns set in hive.io.file.readcolumn.names are [order_id, customer_id], the inspector created for table default.customers just contains column [customer_id], when table default.customers selectOperator is initialized, field first_name cannot be found from inspector we just created, so exception below is thrown

cannot find field first_name from [org.apache.iceberg.mr.hive.serde.objectinspector.IcebergRecordObjectInspector$IcebergRecordStructField@a6807525] at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef

The cause of this exception is that the schema get from Schema.select() is not what we want, returning an inspector contains all table columns is an easy workaround to fix this issue.

Thanks for the explanation!

pvary · 2021-01-08T10:50:42Z

mr/src/test/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandlerTestUtils.java


    return testTableType.instance(shell.metastore().hiveConf(), temp);
  }

  static void init(TestHiveShell shell, TestTables testTables, TemporaryFolder temp, String engine) {
+    init(shell, testTables, temp, engine, "false");


Could we use a boolean instead of the string?

pvary · 2021-01-08T10:51:17Z

mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java

 import static org.apache.iceberg.types.Types.NestedField.required;
 import static org.junit.runners.Parameterized.Parameter;
 import static org.junit.runners.Parameterized.Parameters;

 @RunWith(Parameterized.class)
 public class TestHiveIcebergStorageHandlerWithEngine {

-  private static final String[] EXECUTION_ENGINES = new String[] {"tez", "mr"};
+  private static final String[] EXECUTION_ENGINES = new String[]{"tez", "mr"};


nit: I usually try to avoid formatting only changes

Sorry, it's my fault, my IDEA automatically reformat code. I will pay more attention next time.

pvary · 2021-01-08T10:53:01Z

mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java

-  private static final String[] EXECUTION_ENGINES = new String[] {"tez", "mr"};
+  private static final String[] EXECUTION_ENGINES = new String[]{"tez", "mr"};
+
+  private static final String[] CBO_ENABLES = new String[]{"true", "false"};


We might not need this list as the values can not be extended even on the long run. Could we just add this by hand at the parameters() method?

Okay, i will move CBO_ENABLES list to parameters method

pvary · 2021-01-08T10:53:44Z

mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java

@@ -107,6 +126,9 @@
  @Parameter(2)
  public TestTables.TestTableType testTableType;

+  @Parameter(3)
+  public String cboEnable;


Boolean please

pvary · 2021-01-08T10:54:53Z

mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java

+            shell.executeStatement("SELECT first_name, customer_id FROM default.customers ORDER BY customer_id DESC");

    Assert.assertEquals(3, descRows.size());
-    Assert.assertArrayEquals(new Object[] {"Trudy", 2L}, descRows.get(0));
-    Assert.assertArrayEquals(new Object[] {"Bob", 1L}, descRows.get(1));
-    Assert.assertArrayEquals(new Object[] {"Alice", 0L}, descRows.get(2));
+    Assert.assertArrayEquals(new Object[]{"Trudy", 2L}, descRows.get(0));
+    Assert.assertArrayEquals(new Object[]{"Bob", 1L}, descRows.get(1));
+    Assert.assertArrayEquals(new Object[]{"Alice", 0L}, descRows.get(2));
+  }


Am I right that these are formatting only changes?
It is much easier if we do not have them, so I usually try to avoid them in my PRs

pvary · 2021-01-08T10:56:25Z

mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java

+            "SELECT o1.order_id, o1.customer_id, o1.total " +
+                    "FROM default.orders o1 JOIN default.orders o2 ON o1.order_id = o2.order_id ORDER BY o1.order_id"


Why was this change needed?
Could you please help?
Thanks,
Peter

As show in https://github.com/apache/hive/blob/113f6af7528f016bf821f7a746bad496cc93f834/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L992-L998,
self join can lead duplicated columns in jobConf hive.io.file.readcolumn.names.
In this case, it is necessary to test whether self join can work correctly.

Oh.. I have missed that the FROM part is from the orders table.
Sorry

pvary · 2021-01-08T11:01:52Z

Thanks for the patch @qphien!
Really appreciate that you have taken the time to track this down!

Asked a few questions in the review comments. The general observations:

Please do not do formatting only changes - These are making the review (and backport) harder
I would recommend to use a boolean or Boolean instead of "false"/"true" strings
There is one particular line I personally removed once accidentally and later realized that it is needed for performant queries (we might want to add a test case for it 😄). I think the line is still needed.
I asked @marton-bod to review the projection related part since he was the one working on that.

Thanks,
Peter

qphien · 2021-01-09T04:16:18Z

I'm not sure whether these failure tests are related to this PR. I tested locally and all test cases were passed.

org.apache.iceberg.spark.extensions.TestCopyOnWriteDelete > testDeleteWithSerializableIsolation[catalogName = spark_catalog, implementation = org.apache.iceberg.spark.SparkSessionCatalog, config = {type=hive, default-namespace=default, clients=1, parquet-enabled=false, cache-enabled=false}, format = avro, vectorized = false] FAILED
    java.lang.RuntimeException: Failed to get table info from metastore default.table

        Caused by:
        org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe (Write failed)

            Caused by:
            java.net.SocketException: Broken pipe (Write failed)

marton-bod · 2021-01-12T11:33:02Z

mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java

  public static Collection<Object[]> parameters() {
    Collection<Object[]> testParams = new ArrayList<>();
    String javaVersion = System.getProperty("java.specification.version");
+    List<Boolean> cboEnables = ImmutableList.of(true, false);


Instead of running all tests for both CBO on/off, can we turn it on just in those one or two unit test cases where we want to test it, by setting it via shell.setHiveSessionValue() at the beginning of the test? We're trying to prevent the number of test runs from exploding via the combination of ever-increasing test parameters.

@marton-bod OK, I have moved CBO test to some join test cases.

marton-bod

LGTM, thanks for picking this up @qphien

rdblue · 2021-01-13T22:19:18Z

Thanks for reviewing, @marton-bod and @pvary. Have all your concerns been addressed?

marton-bod · 2021-01-14T11:38:16Z

@rdblue yes, it should be good to go from my side. Thanks!

pvary · 2021-01-14T14:21:20Z

+1 from my side too

rdblue · 2021-01-15T17:27:50Z

Thanks for fixing this, @qphien! And thanks for reviewing, @marton-bod and @pvary!

Co-authored-by: 罗冲 <luochong@corp.netease.com>

Fix join issues in hive when CBO is enabled

db6b410

github-actions bot added the MR label Jan 8, 2021

pvary reviewed Jan 8, 2021

View reviewed changes

Reformat code and remove unnecessary static list

fb44496

marton-bod reviewed Jan 12, 2021

View reviewed changes

Running CBO test only in some join test cases

a174634

marton-bod approved these changes Jan 13, 2021

View reviewed changes

pvary approved these changes Jan 15, 2021

View reviewed changes

rdblue merged commit d50f540 into apache:master Jan 15, 2021

XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021

Hive: Fix join issues when CBO is enabled (apache#2052)

c0d979b

Co-authored-by: 罗冲 <luochong@corp.netease.com>

openinx mentioned this pull request Oct 27, 2021

Hive: Bug when runing SQL with multiple table join #3393

Closed

pvary mentioned this pull request Nov 2, 2021

Hive: Bug when runing SQL with multiple table join. #3392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hive: Fix join issues when CBO is enabled #2052

Hive: Fix join issues when CBO is enabled #2052

qphien commented Jan 8, 2021

pvary Jan 8, 2021

qphien Jan 8, 2021

pvary Jan 8, 2021

qphien Jan 9, 2021

marton-bod Jan 11, 2021

qphien Jan 11, 2021

pvary Jan 8, 2021

marton-bod Jan 11, 2021

qphien Jan 11, 2021

marton-bod Jan 12, 2021

pvary Jan 8, 2021

pvary Jan 8, 2021

qphien Jan 8, 2021

pvary Jan 8, 2021 •

edited

Loading

qphien Jan 8, 2021

pvary Jan 8, 2021

pvary Jan 8, 2021

pvary Jan 8, 2021

qphien Jan 8, 2021

pvary Jan 8, 2021

pvary commented Jan 8, 2021

qphien commented Jan 9, 2021

marton-bod Jan 12, 2021

qphien Jan 13, 2021

marton-bod left a comment

rdblue commented Jan 13, 2021

marton-bod commented Jan 14, 2021

pvary commented Jan 14, 2021

rdblue commented Jan 15, 2021

		"SELECT o1.order_id, o1.customer_id, o1.total " +
		"FROM default.orders o1 JOIN default.orders o2 ON o1.order_id = o2.order_id ORDER BY o1.order_id"

Hive: Fix join issues when CBO is enabled #2052

Hive: Fix join issues when CBO is enabled #2052

Conversation

qphien commented Jan 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvary Jan 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvary commented Jan 8, 2021

qphien commented Jan 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marton-bod left a comment

Choose a reason for hiding this comment

rdblue commented Jan 13, 2021

marton-bod commented Jan 14, 2021

pvary commented Jan 14, 2021

rdblue commented Jan 15, 2021

pvary Jan 8, 2021 •

edited

Loading