Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC, Hive: Fix column projection when reading ORC files and orc.force.positional.evolution is set to true on the default configuration #2111

Merged
merged 2 commits into from
Jan 18, 2021

Conversation

pvary
Copy link
Contributor

@pvary pvary commented Jan 18, 2021

After this change (HIVE-20129 Revert to position based schema evolution for orc tables) in Hive the Iceberg ORC column projection is not working as expected.

The effect:

  • If the table has 3 columns (customer_id, first_name, last_name)
  • And the query requests the last 2 columns (first_name, last_name)
  • We will read the first 2 (customer_id, first_name) instead of the requested ones

I think it would be good to turn off this config for our Iceberg ORC readers as we expect it to be false anyway and other systems using Iceberg might need it to be set differently.

The patch contains 3 groups of changes:

  • Test infra changes so we can change HMS configuration (this is used for HiveCatalog FileIO creation)
  • Actually setting the orc.force.positional.evolution to true in the tests - this causes TestHiveIcebergStorageHandlerLocalScan.testColumnSelection() to fail
  • Setting orc.force.positional.evolution to false in ORC.ReadBuilder to override the configuration values to the expected one - tis fixes the test failures

…e.positional.evolution` is set to `true` on the default configuration
@rdblue
Copy link
Contributor

rdblue commented Jan 18, 2021

If I understand correctly, the id-based projection requests fields by name and if the projection type is changed in the Hive conf, that causes the wrong projection. Right? If so, then I agree with the fix to set this in the conf that we use for readers.

@@ -106,7 +115,9 @@ public void start(int poolSize) {

TServerSocket socket = new TServerSocket(0);
int port = socket.getServerSocket().getLocalPort();
this.hiveConf = newHiveConf(port);
initConf(conf, port);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: if initConf returned the configuration, this could be one line still: this.hiveConf = initConf(conf, port);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the change the configuration was created in newHiveConf(). The new method changes values in the conf parameter. The change meant to emphasize that we are not creating a new conf, only changing values in the one got in the parameter list.

Shall we move back to create a new config and initialize that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If conf is passed in, I think the change to initConf makes sense. I just think the patch could be slightly smaller. Not a blocker, though.

@rdblue rdblue merged commit e0de218 into apache:master Jan 18, 2021
@pvary pvary deleted the orcfix branch January 19, 2021 08:57
XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants