-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: add JSON parser for ContentFile and FileScanTask #6934
Conversation
/** | ||
* Return the schema for this file scan task. | ||
*/ | ||
default Schema schema() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is needed so that FileScanTaskParser
(added in this PR) can serialize the schema. Then during the deserialization part, schema can be pass into the constructor of BaseFileScanTask
.
Keep it at this level (not base ContentScanTask
interface or lower) to limit the scope of change.
@@ -52,12 +53,24 @@ public F file() { | |||
return file; | |||
} | |||
|
|||
protected Schema schema() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exposed as protected
so that BaseFileScanTask
can use it to implement the FileScanTask#schema()
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Little odd that we reverse engineer the schema from the string here, but seems like the most backwards compatible thing we can do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree it is a little odd. On the other hand, partition spec is in the same model in this class. As you said, otherwise we would have to change the constructors of a bunch of classes. The current choice of passing schema and spec as strings is to make those scan tasks serializable.
@Override
public PartitionSpec spec() {
if (spec == null) {
synchronized (this) {
if (spec == null) {
this.spec = PartitionSpecParser.fromJson(schema(), specString);
}
}
}
return spec;
}
cc @nastra
import org.apache.iceberg.util.ArrayUtil; | ||
import org.apache.iceberg.util.JsonUtil; | ||
|
||
class ContentFileParser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since DataFile and DeleteFile has the same structure, calling this ContentFileParser
without any generic type.
@@ -134,6 +134,8 @@ public static class Builder { | |||
private Map<Integer, ByteBuffer> upperBounds = null; | |||
private ByteBuffer keyMetadata = null; | |||
private List<Long> splitOffsets = null; | |||
private List<Integer> equalityFieldIds = null; | |||
private Integer sortOrderId = SortOrder.unsorted().orderId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relocated the line here to follow the same order of definition
@@ -134,6 +134,8 @@ public static class Builder { | |||
private Map<Integer, ByteBuffer> upperBounds = null; | |||
private ByteBuffer keyMetadata = null; | |||
private List<Long> splitOffsets = null; | |||
private List<Integer> equalityFieldIds = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a setter for equalityFieldIds
so that the parser unit test can cover this field too.
|
||
private final PartitionSpec spec; | ||
|
||
ContentFileParser(PartitionSpec spec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike other JSON parser with a static singleton pattern, ContentFileParser
depends on the partition spec. Hence this is a regular class and constructor.
af84243
to
f271871
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a high-level pass over the parsers themselves and left a few comments. I haven't had a chance to look closer at the tests yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nastra thx a lot for the initial review. I addressed the comments in the latest commit
core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestContentFileParser.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestContentFileParser.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestContentFileParser.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java
Outdated
Show resolved
Hide resolved
a8062a7
to
4d57100
Compare
78fec72
to
92a162f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the late re-review @stevenzwu, I've left a few more comments.
core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been mainly focusing on the JSON parsers and left a few comments, but overall this looks almost ready. It would be great to get some additional input from another reviewer
import org.junit.jupiter.params.provider.Arguments; | ||
import org.junit.jupiter.params.provider.MethodSource; | ||
|
||
public class TestContentFileParser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to also add a test with a plain JSON string to see how the full JSON looks like. And then maybe also another test with a plain JSON string where all optional fields (metrics, equality field ids, sort order id, split offsets, ...) are missing
4242a0b
to
00bf6c0
Compare
|
||
JsonNode pNode = node.get(property); | ||
Preconditions.checkArgument( | ||
pNode.isTextual(), "Cannot parse from non-text value: %s: %s", property, pNode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe we should mention that we're trying to parse this from text to a binary representation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also fixed a couple other error msgs with the same problem.
Spark CI build failed with some seemingly env problem
|
61e40a7
to
0016f36
Compare
…pected json string
a465d34
to
8105811
Compare
merging after rebase |
this closes issue #1698.
There are two motivations as described by issue #1698.