Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC: Add compression properties #4273

Merged
merged 5 commits into from
Mar 30, 2022

Conversation

zhongyujiang
Copy link
Contributor

This PR adds compression-codec and compression-strategy of ORC in TableProperties, @openinx @hililiwei would you help review this? Thanks!


return new Context(stripeSize, blockSize);
static String lookupValue(Configuration conf, OrcConf orcConf, String icebergPropertyKey) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is wrong if this is looking up an Iceberg table property like ORC_STRIPE_SIZE_BYTES in a Hadoop Configuration. Iceberg settings should be pulled from table properties, not from Hadoop.

@rdblue
Copy link
Contributor

rdblue commented Mar 7, 2022

I don't think that we can merge this until we fix the ORC configuration PR that recently went in. See #3810 (comment) for more details. The configuration context should not use Hadoop Configuration.

@zhongyujiang zhongyujiang force-pushed the orc-compression-props branch from a93bda2 to d54c377 Compare March 28, 2022 11:55
@zhongyujiang
Copy link
Contributor Author

Updated based on #4291.

@zhongyujiang zhongyujiang requested a review from rdblue March 28, 2022 12:27
@github-actions github-actions bot added the docs label Mar 28, 2022
Comment on lines 262 to 267
String codecAsString = config.getOrDefault(OrcConf.COMPRESS.getAttribute(), ORC_COMPRESSION_DEFAULT);
codecAsString = config.getOrDefault(ORC_COMPRESSION, codecAsString);

String compressionStrategy = config.getOrDefault(OrcConf.COMPRESSION_STRATEGY.getAttribute(),
ORC_COMPRESSION_STRATEGY_DEFAULT);
compressionStrategy = config.getOrDefault(ORC_COMPRESSION_STRATEGY, compressionStrategy);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just follow the same config parse approach as the above config keys did ? I'm saying using the generic PropertyUtil.propertyAsString.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, please also use the CompressionStrategy enum to validate whether this configured value is a valid one ? ( Just as this comment said: https://github.com/apache/iceberg/pull/4273/files#r836994947 )

Preconditions.checkArgument(vectorizedRowBatchSize > 0, "VectorizedRow batch size must be > 0");

return new Context(stripeSize, blockSize, vectorizedRowBatchSize);
String codecAsString = config.getOrDefault(OrcConf.COMPRESS.getAttribute(), ORC_COMPRESSION_DEFAULT);
codecAsString = config.getOrDefault(ORC_COMPRESSION, codecAsString);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will suggest to use the similar approach as parquet did: try to parse the string into a CompressionKind so that we can validate whether this value is a valid one in the correct place.

See: https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L392

Comment on lines 282 to 285
String codecAsString = config.getOrDefault(DELETE_ORC_COMPRESSION, dataContext.codecAsString());

String compressionStrategy = config.getOrDefault(DELETE_ORC_COMPRESSION_STRATEGY,
dataContext.compressionStrategy());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please use the same approach to parse the codec & compression-strategy .

Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left serval comments, and I think we are recommended to add unit tests as this PR did to avoid regression .

@zhongyujiang
Copy link
Contributor Author

Updated based on comments.

@@ -82,19 +93,29 @@ public void testOrcTableProperties() throws Exception {
DynFields.builder().hiddenImpl(writer.getClass(), "conf").build(writer);

Configuration configuration = confField.get();
Assert.assertEquals(String.valueOf(blockSizeBytes), configuration.get(OrcConf.BLOCK_SIZE.getAttribute()));
Assert.assertEquals(String.valueOf(stripeSizeBytes), configuration.get(OrcConf.STRIPE_SIZE.getAttribute()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to using OrcConf to get value directly so no need to know property key detail.

@@ -53,12 +56,20 @@

@Test
public void testOrcTableProperties() throws Exception {
Long stripeSizeBytes = 32L * 1024 * 1024;
Long blockSizeBytes = 128L * 1024 * 1024;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed Long to long because what OrcConf.XXX#getLong returns is a long, and I got

Ambiguous method call. Both assertEquals(Object, Object) in Assert and assertEquals(long, long) in Assert match

when using both Long and long as parameters for Assert#assertEquals.

@zhongyujiang zhongyujiang requested a review from openinx March 30, 2022 05:48
@openinx openinx merged commit aea336e into apache:master Mar 30, 2022
@zhongyujiang
Copy link
Contributor Author

Thanks all for your review!

@zhongyujiang zhongyujiang deleted the orc-compression-props branch April 1, 2022 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants