Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Use avro compression properties from table properties while writing Manifest and Manifest list files. #5893

Closed

Conversation

sumeetgajjar
Copy link
Contributor

This PR fixes #5892.

In this change, we add the table properties parameter to ManifestWriter. ManifestWriter further passes this properties map to the newAppender method, which uses the properties while creating AvroFileAppender.

Testing done:

  • Added a UT to validate the Avro compression properties are respected while writing the manifest and manifest list files.
  • Manually verified compression codec of manifest and manifest list files of an Iceberg Table (created using Spark) using avro-tools

@sumeetgajjar
Copy link
Contributor Author

* @return a manifest writer
*/
public static ManifestWriter<DataFile> write(
PartitionSpec spec, OutputFile outputFile, Map<String, String> config) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods should always include the format version.

Also, I don't think it is a good idea to add a property map throughout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The write method without format version is already referenced in the code base prior to my changes.

public static ManifestWriter<DataFile> write(PartitionSpec spec, OutputFile outputFile) {

Thus, I copied the signature of the above write method and added compressionCodec and compressionLevel as additional parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, if the caller decides to use the AVRO_COMPRESSION_LEVEL_DEFAULT value then using int as the datatype would not be possible.

public static final String AVRO_COMPRESSION_LEVEL_DEFAULT = null;

@rdblue
Copy link
Contributor

rdblue commented Oct 4, 2022

@sumeetgajjar, I think it's a good idea to be able to set the compression codec, but we don't want to pass a map of properties around. That's exposing too much where it doesn't need to be, and people tend to misuse generic arguments like this. Let's pass the compression code explicitly.

@sumeetgajjar
Copy link
Contributor Author

@sumeetgajjar, I think it's a good idea to be able to set the compression codec, but we don't want to pass a map of properties around. That's exposing too much where it doesn't need to be, and people tend to misuse generic arguments like this. Let's pass the compression code explicitly.

@rdblue thank you for the comment, incorporated the suggestions in the latest commit.

@sumeetgajjar sumeetgajjar requested a review from rdblue October 5, 2022 15:39
* @param compressionLevel compression level of the compressionCodec
* @return a manifest writer
*/
public static ManifestWriter<DataFile> write(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeetgajjar, this must explicitly set the format version. This is a new method, so the format version should always be included. Older methods did not use a format version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding formatVersion here would make its signature the same as the other newly added write method. Thus removing it altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the method altogether due to the reason stated above.

* @return a manifest writer
*/
public static ManifestWriter<DataFile> write(
PartitionSpec spec, OutputFile outputFile, String compressionCodec, String compressionLevel) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compression level is an int, not a string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll refactor it to Integer.
The reason I suggest using Integer and not the primitive int type is to facilitate cases when the codec is set to snappy and uncompressed. Since both of these codecs, no compression level is required, null can be passed here.

I would like to hear your thoughts on this before I go ahead and make the change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and used Integer for compressionLevel.
This is to ensure we can pass null values for Codecs that do not require it.
Had we used int, then we would not be able to represent null.

table.spec(),
manifestLocation,
table.properties().get(TableProperties.AVRO_COMPRESSION),
table.properties().get(TableProperties.AVRO_COMPRESSION_LEVEL));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is worth updating places like this. It's just needless code churn when we have no reason to think that the codec would be set or matter to the test. Can you remove cases like this from the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted and tried reducing the overall churn.

parentId,
0,
/* compressionCodec */ null,
/* compressionLevel */ null)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not worth updating this, can you revert it?

Copy link
Contributor Author

@sumeetgajjar sumeetgajjar Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted

@@ -22,6 +22,8 @@

import java.io.IOException;
import java.util.List;
import java.util.Map;
import org.apache.iceberg.avro.AvroIterable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the updates here? I can understand testing that the manifest is written with a certain codec, but that should be in TestManifestWriter right? This tests different versions, which doesn't have much to do with the compression codec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll move the ManifestWriter tests to TestManifestWriter.

For ManifestListWriter there does not exist a TestManifestListWriter class, I'll create the new test class and move the ManifestListerWriter tests to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @sumeetgajjar. This is looking better, but it has a lot of unnecessary changes just to call the new method with compression settings that don't matter. Can you make this smaller by removing the parts that don't need to be changed?

1. remove unecessary edits to use compression codec and
   compression level from tests
2. move ManifestWriter tests to TestManifestWriter
3. move ManifestListWriter tests to TestManifestListWriter
4. remove unwanted `ManifestLists#write` method
@sumeetgajjar
Copy link
Contributor Author

Thanks, @sumeetgajjar. This is looking better, but it has a lot of unnecessary changes just to call the new method with compression settings that don't matter. Can you make this smaller by removing the parts that don't need to be changed?

Thanks @rdblue for the feedback, tried avoiding unnecessary changes and made the change comparatively smaller.

@sumeetgajjar sumeetgajjar requested a review from rdblue October 7, 2022 07:05
@sumeetgajjar
Copy link
Contributor Author

@rdblue I have addressed your comments, can you please review the PR once you have time?

Comment on lines +240 to +245
if (compressionCodec != null) {
builder.set(TableProperties.AVRO_COMPRESSION, compressionCodec);
}
if (compressionLevel != null) {
builder.set(TableProperties.AVRO_COMPRESSION_LEVEL, compressionLevel.toString());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: new line between the if blocks and the next statement

Comment on lines +286 to +291
if (compressionCodec != null) {
builder.set(TableProperties.AVRO_COMPRESSION, compressionCodec);
}
if (compressionLevel != null) {
builder.set(TableProperties.AVRO_COMPRESSION_LEVEL, compressionLevel.toString());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same new line

Comment on lines +170 to +175
if (compressionCodec != null) {
builder.set(TableProperties.AVRO_COMPRESSION, compressionCodec);
}
if (compressionLevel != null) {
builder.set(TableProperties.AVRO_COMPRESSION_LEVEL, compressionLevel.toString());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: newline between blocks and the next statement

    if (compressionCodec != null) {
      builder.set(TableProperties.AVRO_COMPRESSION, compressionCodec);
    }
    
    if (compressionLevel != null) {
      builder.set(TableProperties.AVRO_COMPRESSION_LEVEL, compressionLevel.toString());
    }
    
    return builder.build();

Comment on lines +21 to +34
public class NumberUtil {

private NumberUtil() {}

/**
* @param value the string to convert, can be null
* @return parsed integer, returns null if the string is null
*/
public static Integer createInteger(String value) {
if (value == null) {
return null;
}
return Integer.parseInt(value);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need this or is there another util where the function can go? At minimum IMO shouldn't be public

* @param value the string to convert, can be null
* @return parsed integer, returns null if the string is null
*/
public static Integer createInteger(String value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this class. We are reading current.properties().get(TableProperties.AVRO_COMPRESSION_LEVEL) as a String in order to convert it to an Integer, just to convert it later back to a String

Copy link
Contributor

@wypoon wypoon Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this class. Instead of NumberUtil.createInteger we can use PropertyUtil.propertyAsNullableInt on the properties Map and property name, and that gets the value of the property as an Integer.

manifestFile,
snapshotId,
compressionCodec,
/* compressionLevel */ null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just pass AVRO_COMPRESSION_LEVEL_DEFAULT rather than null?

@@ -237,12 +253,23 @@ ManifestFile writeManifest(DataFile... files) throws IOException {
}

ManifestFile writeManifest(Long snapshotId, DataFile... files) throws IOException {
File manifestFile = temp.newFile("input.m0.avro");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to maintain/keep that input.m0.avro file name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this is needed, because we're determining the file format from the file ending

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do need the manifest file to end in ".avro", but it's fine to call temp.newFile() (which uses a random name) and add the extension afterwards.


static final long SNAPSHOT_ID = 987134631982734L;

private static final long SEQUENCE_NUMBER = 34L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this constant? It seems to be used only in a single place

@@ -163,6 +166,19 @@ public class TableTestBase {

static final FileIO FILE_IO = new TestTables.LocalFileIO();

static final Map<String, String> CODEC_METADATA_MAPPING =
ImmutableMap.<String, String>builder()
.put("uncompressed", "null")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the Avro.Codec enum here rather than plain strings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much benefit to using Avro.Codec. You'd have to get the name from the enum and then lowercase it. I think in this case, plain strings are fine.

import org.junit.runners.Parameterized;

@RunWith(Parameterized.class)
public class TestManifestListWriter extends TableTestBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding these tests to TestManifestListVersions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think TestManifestListVersions is intended for tests of versions, similar to TestManifestWriterVersions. In fact, @rdblue suggested that the tests @sumeetgajjar added for manifests be moved from TestManifestWriterVersions to TestManifestWriter for that reason. There is no existing TestManifestListWriter, so Sumeet created one.

@@ -286,6 +286,8 @@ project(':iceberg-core') {
testImplementation "org.xerial:sqlite-jdbc"
testImplementation project(path: ':iceberg-api', configuration: 'testArtifacts')
testImplementation "com.esotericsoftware:kryo"
testImplementation "org.xerial.snappy:snappy-java"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used anywhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems not and can be removed.

@sumeetgajjar
Copy link
Contributor Author

sumeetgajjar commented Jan 9, 2023

Thanks for the reviews @rdblue @amogh-jahagirdar @nastra, really appreciate it.

There was an internal redistribution of tasks and a change of ownership of some GitHub PRs due to internal changes.
I am closing this PR and moving forward @wypoon or @sririshindra would be leading the changes suggested in this PR.
They would pull the public branch associated with this PR, add their commits on top of it and create a new PR to ease the logistics of the transition.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Iceberg does not respect the avro properties from TBLPROPERTIES while writing Manifest files
5 participants