-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Use avro compression properties from table properties while writing Manifest and Manifest list files. #5893
Core: Use avro compression properties from table properties while writing Manifest and Manifest list files. #5893
Conversation
* @return a manifest writer | ||
*/ | ||
public static ManifestWriter<DataFile> write( | ||
PartitionSpec spec, OutputFile outputFile, Map<String, String> config) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods should always include the format version.
Also, I don't think it is a good idea to add a property map throughout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The write method without format version is already referenced in the code base prior to my changes.
public static ManifestWriter<DataFile> write(PartitionSpec spec, OutputFile outputFile) { |
Thus, I copied the signature of the above write
method and added compressionCodec
and compressionLevel
as additional parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, if the caller decides to use the AVRO_COMPRESSION_LEVEL_DEFAULT
value then using int
as the datatype would not be possible.
public static final String AVRO_COMPRESSION_LEVEL_DEFAULT = null; |
@sumeetgajjar, I think it's a good idea to be able to set the compression codec, but we don't want to pass a map of properties around. That's exposing too much where it doesn't need to be, and people tend to misuse generic arguments like this. Let's pass the compression code explicitly. |
@rdblue thank you for the comment, incorporated the suggestions in the latest commit. |
* @param compressionLevel compression level of the compressionCodec | ||
* @return a manifest writer | ||
*/ | ||
public static ManifestWriter<DataFile> write( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumeetgajjar, this must explicitly set the format version. This is a new method, so the format version should always be included. Older methods did not use a format version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding formatVersion
here would make its signature the same as the other newly added write
method. Thus removing it altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the method altogether due to the reason stated above.
* @return a manifest writer | ||
*/ | ||
public static ManifestWriter<DataFile> write( | ||
PartitionSpec spec, OutputFile outputFile, String compressionCodec, String compressionLevel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compression level is an int, not a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll refactor it to Integer
.
The reason I suggest using Integer
and not the primitive int type is to facilitate cases when the codec is set to snappy
and uncompressed
. Since both of these codecs, no compression level is required, null can be passed here.
I would like to hear your thoughts on this before I go ahead and make the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and used Integer
for compressionLevel
.
This is to ensure we can pass null
values for Codecs that do not require it.
Had we used int
, then we would not be able to represent null
.
table.spec(), | ||
manifestLocation, | ||
table.properties().get(TableProperties.AVRO_COMPRESSION), | ||
table.properties().get(TableProperties.AVRO_COMPRESSION_LEVEL)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is worth updating places like this. It's just needless code churn when we have no reason to think that the codec would be set or matter to the test. Can you remove cases like this from the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted and tried reducing the overall churn.
parentId, | ||
0, | ||
/* compressionCodec */ null, | ||
/* compressionLevel */ null)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not worth updating this, can you revert it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted
@@ -22,6 +22,8 @@ | |||
|
|||
import java.io.IOException; | |||
import java.util.List; | |||
import java.util.Map; | |||
import org.apache.iceberg.avro.AvroIterable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of the updates here? I can understand testing that the manifest is written with a certain codec, but that should be in TestManifestWriter
right? This tests different versions, which doesn't have much to do with the compression codec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll move the ManifestWriter
tests to TestManifestWriter
.
For ManifestListWriter
there does not exist a TestManifestListWriter
class, I'll create the new test class and move the ManifestListerWriter
tests to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @sumeetgajjar. This is looking better, but it has a lot of unnecessary changes just to call the new method with compression settings that don't matter. Can you make this smaller by removing the parts that don't need to be changed?
1. remove unecessary edits to use compression codec and compression level from tests 2. move ManifestWriter tests to TestManifestWriter 3. move ManifestListWriter tests to TestManifestListWriter 4. remove unwanted `ManifestLists#write` method
Thanks @rdblue for the feedback, tried avoiding unnecessary changes and made the change comparatively smaller. |
@rdblue I have addressed your comments, can you please review the PR once you have time? |
if (compressionCodec != null) { | ||
builder.set(TableProperties.AVRO_COMPRESSION, compressionCodec); | ||
} | ||
if (compressionLevel != null) { | ||
builder.set(TableProperties.AVRO_COMPRESSION_LEVEL, compressionLevel.toString()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: new line between the if blocks and the next statement
if (compressionCodec != null) { | ||
builder.set(TableProperties.AVRO_COMPRESSION, compressionCodec); | ||
} | ||
if (compressionLevel != null) { | ||
builder.set(TableProperties.AVRO_COMPRESSION_LEVEL, compressionLevel.toString()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same new line
if (compressionCodec != null) { | ||
builder.set(TableProperties.AVRO_COMPRESSION, compressionCodec); | ||
} | ||
if (compressionLevel != null) { | ||
builder.set(TableProperties.AVRO_COMPRESSION_LEVEL, compressionLevel.toString()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: newline between blocks and the next statement
if (compressionCodec != null) {
builder.set(TableProperties.AVRO_COMPRESSION, compressionCodec);
}
if (compressionLevel != null) {
builder.set(TableProperties.AVRO_COMPRESSION_LEVEL, compressionLevel.toString());
}
return builder.build();
public class NumberUtil { | ||
|
||
private NumberUtil() {} | ||
|
||
/** | ||
* @param value the string to convert, can be null | ||
* @return parsed integer, returns null if the string is null | ||
*/ | ||
public static Integer createInteger(String value) { | ||
if (value == null) { | ||
return null; | ||
} | ||
return Integer.parseInt(value); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need this or is there another util where the function can go? At minimum IMO shouldn't be public
* @param value the string to convert, can be null | ||
* @return parsed integer, returns null if the string is null | ||
*/ | ||
public static Integer createInteger(String value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this class. We are reading current.properties().get(TableProperties.AVRO_COMPRESSION_LEVEL)
as a String in order to convert it to an Integer, just to convert it later back to a String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this class. Instead of NumberUtil.createInteger
we can use PropertyUtil.propertyAsNullableInt
on the properties Map
and property name, and that gets the value of the property as an Integer.
manifestFile, | ||
snapshotId, | ||
compressionCodec, | ||
/* compressionLevel */ null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just pass AVRO_COMPRESSION_LEVEL_DEFAULT
rather than null?
@@ -237,12 +253,23 @@ ManifestFile writeManifest(DataFile... files) throws IOException { | |||
} | |||
|
|||
ManifestFile writeManifest(Long snapshotId, DataFile... files) throws IOException { | |||
File manifestFile = temp.newFile("input.m0.avro"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to maintain/keep that input.m0.avro
file name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this is needed, because we're determining the file format from the file ending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do need the manifest file to end in ".avro", but it's fine to call temp.newFile()
(which uses a random name) and add the extension afterwards.
|
||
static final long SNAPSHOT_ID = 987134631982734L; | ||
|
||
private static final long SEQUENCE_NUMBER = 34L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this constant? It seems to be used only in a single place
@@ -163,6 +166,19 @@ public class TableTestBase { | |||
|
|||
static final FileIO FILE_IO = new TestTables.LocalFileIO(); | |||
|
|||
static final Map<String, String> CODEC_METADATA_MAPPING = | |||
ImmutableMap.<String, String>builder() | |||
.put("uncompressed", "null") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use the Avro.Codec
enum here rather than plain strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see much benefit to using Avro.Codec
. You'd have to get the name from the enum and then lowercase it. I think in this case, plain strings are fine.
import org.junit.runners.Parameterized; | ||
|
||
@RunWith(Parameterized.class) | ||
public class TestManifestListWriter extends TableTestBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about adding these tests to TestManifestListVersions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think TestManifestListVersions
is intended for tests of versions, similar to TestManifestWriterVersions
. In fact, @rdblue suggested that the tests @sumeetgajjar added for manifests be moved from TestManifestWriterVersions
to TestManifestWriter
for that reason. There is no existing TestManifestListWriter
, so Sumeet created one.
@@ -286,6 +286,8 @@ project(':iceberg-core') { | |||
testImplementation "org.xerial:sqlite-jdbc" | |||
testImplementation project(path: ':iceberg-api', configuration: 'testArtifacts') | |||
testImplementation "com.esotericsoftware:kryo" | |||
testImplementation "org.xerial.snappy:snappy-java" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems not and can be removed.
Thanks for the reviews @rdblue @amogh-jahagirdar @nastra, really appreciate it. There was an internal redistribution of tasks and a change of ownership of some GitHub PRs due to internal changes. Thanks. |
This PR fixes #5892.
In this change, we add the table properties parameter to
ManifestWriter
.ManifestWriter
further passes this properties map to thenewAppender
method, which uses the properties while creatingAvroFileAppender
.Testing done:
avro-tools