core: add JSON parser for ContentFile and FileScanTask #6934

stevenzwu · 2023-02-24T19:25:16Z

this closes issue #1698.

There are two motivations as described by issue #1698.

provide a more stable serialization (than Java serialization) for Flink checkpoint
can be used by REST catalog for scan planning or committing files

stevenzwu · 2023-02-24T19:28:32Z

api/src/main/java/org/apache/iceberg/FileScanTask.java

+  /**
+   * Return the schema for this file scan task.
+   */
+  default Schema schema() {


this is needed so that FileScanTaskParser (added in this PR) can serialize the schema. Then during the deserialization part, schema can be pass into the constructor of BaseFileScanTask.

Keep it at this level (not base ContentScanTask interface or lower) to limit the scope of change.

stevenzwu · 2023-02-24T19:29:46Z

core/src/main/java/org/apache/iceberg/BaseContentScanTask.java

@@ -52,12 +53,24 @@ public F file() {
    return file;
  }

+  protected Schema schema() {


exposed as protected so that BaseFileScanTask can use it to implement the FileScanTask#schema() method

Little odd that we reverse engineer the schema from the string here, but seems like the most backwards compatible thing we can do here.

agree it is a little odd. On the other hand, partition spec is in the same model in this class. As you said, otherwise we would have to change the constructors of a bunch of classes. The current choice of passing schema and spec as strings is to make those scan tasks serializable.

@Override public PartitionSpec spec() { if (spec == null) { synchronized (this) { if (spec == null) { this.spec = PartitionSpecParser.fromJson(schema(), specString); } } } return spec; }

cc @nastra

stevenzwu · 2023-02-24T19:30:23Z

core/src/main/java/org/apache/iceberg/ContentFileParser.java

+import org.apache.iceberg.util.ArrayUtil;
+import org.apache.iceberg.util.JsonUtil;
+
+class ContentFileParser {


since DataFile and DeleteFile has the same structure, calling this ContentFileParser without any generic type.

stevenzwu · 2023-02-24T19:31:18Z

core/src/main/java/org/apache/iceberg/DataFiles.java

@@ -134,6 +134,8 @@ public static class Builder {
    private Map<Integer, ByteBuffer> upperBounds = null;
    private ByteBuffer keyMetadata = null;
    private List<Long> splitOffsets = null;
+    private List<Integer> equalityFieldIds = null;
+    private Integer sortOrderId = SortOrder.unsorted().orderId();


relocated the line here to follow the same order of definition

stevenzwu · 2023-02-24T19:31:51Z

core/src/main/java/org/apache/iceberg/DataFiles.java

@@ -134,6 +134,8 @@ public static class Builder {
    private Map<Integer, ByteBuffer> upperBounds = null;
    private ByteBuffer keyMetadata = null;
    private List<Long> splitOffsets = null;
+    private List<Integer> equalityFieldIds = null;


add a setter for equalityFieldIds so that the parser unit test can cover this field too.

stevenzwu · 2023-02-24T19:33:09Z

core/src/main/java/org/apache/iceberg/ContentFileParser.java

+
+  private final PartitionSpec spec;
+
+  ContentFileParser(PartitionSpec spec) {


Unlike other JSON parser with a static singleton pattern, ContentFileParser depends on the partition spec. Hence this is a regular class and constructor.

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java

nastra

did a high-level pass over the parsers themselves and left a few comments. I haven't had a chance to look closer at the tests yet

core/src/main/java/org/apache/iceberg/ContentFileParser.java

core/src/test/java/org/apache/iceberg/TableTestBase.java

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java

stevenzwu

@nastra thx a lot for the initial review. I addressed the comments in the latest commit

core/src/main/java/org/apache/iceberg/ContentFileParser.java

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java

core/src/test/java/org/apache/iceberg/TableTestBase.java

core/src/main/java/org/apache/iceberg/ContentFileParser.java

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java

core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java

core/src/main/java/org/apache/iceberg/ContentFileParser.java

core/src/test/java/org/apache/iceberg/TestContentFileParser.java

core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java

nastra

sorry for the late re-review @stevenzwu, I've left a few more comments.

core/src/main/java/org/apache/iceberg/ContentFileParser.java

core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java

core/src/main/java/org/apache/iceberg/ContentFileParser.java

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java

nastra

I've been mainly focusing on the JSON parsers and left a few comments, but overall this looks almost ready. It would be great to get some additional input from another reviewer

core/src/main/java/org/apache/iceberg/util/JsonUtil.java

core/src/test/java/org/apache/iceberg/util/TestJsonUtil.java

nastra · 2023-04-25T09:06:22Z

core/src/test/java/org/apache/iceberg/TestContentFileParser.java

+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+public class TestContentFileParser {


I think it would be good to also add a test with a plain JSON string to see how the full JSON looks like. And then maybe also another test with a plain JSON string where all optional fields (metrics, equality field ids, sort order id, split offsets, ...) are missing

nastra · 2023-05-04T06:58:03Z

core/src/main/java/org/apache/iceberg/util/JsonUtil.java

+
+    JsonNode pNode = node.get(property);
+    Preconditions.checkArgument(
+        pNode.isTextual(), "Cannot parse from non-text value: %s: %s", property, pNode);


nit: maybe we should mention that we're trying to parse this from text to a binary representation

I also fixed a couple other error msgs with the same problem.

stevenzwu · 2023-05-04T23:25:20Z

Spark CI build failed with some seemingly env problem

        Caused by:
        java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwxr-xr-x
            at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:724)
            at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:654)
            at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:586)
            at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:548)
            at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:174)
            at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:129)
            at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
            at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
            at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
            at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
            at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:293)
            at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:492)
            at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:352)
            at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:71)
            at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:70)
            at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:224)
            at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
            at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102)

…pected json string

stevenzwu · 2023-06-26T16:02:02Z

merging after rebase

github-actions bot added API core labels Feb 24, 2023

stevenzwu commented Feb 24, 2023

View reviewed changes

stevenzwu force-pushed the issue-1698-split-json branch from af84243 to f271871 Compare February 24, 2023 19:37

stevenzwu requested review from aokolnychyi, pvary and RussellSpitzer February 24, 2023 20:08

nastra requested review from nastra and rdblue February 27, 2023 09:04

nastra reviewed Mar 16, 2023

View reviewed changes

stevenzwu commented Mar 20, 2023

View reviewed changes

nastra reviewed Mar 21, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/ContentFileParser.java Outdated Show resolved Hide resolved

nastra reviewed Apr 3, 2023

View reviewed changes

stevenzwu force-pushed the issue-1698-split-json branch from a8062a7 to 4d57100 Compare April 5, 2023 02:41

nastra reviewed Apr 5, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/FileScanTaskParser.java Outdated Show resolved Hide resolved

stevenzwu mentioned this pull request Apr 11, 2023

Core: View metadata implementation #6559

Closed

stevenzwu force-pushed the issue-1698-split-json branch from 78fec72 to 92a162f Compare April 18, 2023 16:25

nastra reviewed Apr 24, 2023

View reviewed changes

nastra reviewed Apr 25, 2023

View reviewed changes

stevenzwu mentioned this pull request May 3, 2023

Expose data and file sequence numbers on ContentFile or ContentScanTask #7449

Closed

stevenzwu force-pushed the issue-1698-split-json branch from 4242a0b to 00bf6c0 Compare May 4, 2023 04:51

nastra approved these changes May 4, 2023

View reviewed changes

stevenzwu force-pushed the issue-1698-split-json branch from 61e40a7 to 0016f36 Compare June 24, 2023 03:01

stevenzwu added 5 commits June 25, 2023 07:35

core: add JSON parser for ContentFile and FileScanTask

a9db685

Update spec doc with JSON parser for content file and file scan task.

e3dc0b8

Address Eduard's initial round of comments

31c5c8a

switch to Preconditions.checkArgument

ffe4770

address Eduard's second round of comments

c1d913d

stevenzwu added 7 commits June 25, 2023 07:35

fix compiling error after rebase with master

d5cddf8

address comments from Eduard that were missed earlier

f4f4320

Address latest round of comments from Eduard

a0c7ee3

address Eduard's comments on avoiding SingleValueParser and adding ex…

7daa502

…pected json string

improve error msg from JsonUtil

fddc4bb

fix TestTableIdentifierParser due to error msg change in JsonUtil

6ecd0e4

fix test after rebase

8105811

stevenzwu force-pushed the issue-1698-split-json branch from a465d34 to 8105811 Compare June 25, 2023 14:35

stevenzwu merged commit b8db3f0 into apache:master Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: add JSON parser for ContentFile and FileScanTask #6934

core: add JSON parser for ContentFile and FileScanTask #6934

stevenzwu commented Feb 24, 2023 •

edited

Loading

stevenzwu Feb 24, 2023 •

edited

Loading

stevenzwu Feb 24, 2023

RussellSpitzer Apr 24, 2023

stevenzwu Apr 24, 2023

stevenzwu Feb 24, 2023

stevenzwu Feb 24, 2023

stevenzwu Feb 24, 2023

stevenzwu Feb 24, 2023

nastra left a comment

stevenzwu left a comment

nastra left a comment

nastra left a comment

nastra Apr 25, 2023

nastra May 4, 2023

stevenzwu May 4, 2023

stevenzwu commented May 4, 2023

stevenzwu commented Jun 26, 2023


		private final PartitionSpec spec;

		ContentFileParser(PartitionSpec spec) {

core: add JSON parser for ContentFile and FileScanTask #6934

core: add JSON parser for ContentFile and FileScanTask #6934

Conversation

stevenzwu commented Feb 24, 2023 • edited Loading

stevenzwu Feb 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nastra left a comment

Choose a reason for hiding this comment

stevenzwu left a comment

Choose a reason for hiding this comment

nastra left a comment

Choose a reason for hiding this comment

nastra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu commented May 4, 2023

stevenzwu commented Jun 26, 2023

stevenzwu commented Feb 24, 2023 •

edited

Loading

stevenzwu Feb 24, 2023 •

edited

Loading