2680: chore: various refactoring changes for iceberg [iceberg] #14

martin-augment · 2025-11-10T06:09:46Z

2680: To review by AI

Note

Introduces an Iceberg-specific native batch reader using serialized Parquet metadata, refactors the core reader to support FileInfo and pre-initialized readers, and updates Parquet/deps (incl. parquet-format-structures) plus bumps Rust uuid.

Parquet readers (Java):
- Iceberg: New IcebergCometNativeBatchReader accepting Thrift-encoded ParquetMetadata bytes; two-step init via FileInfo; supports passing preInitializedReaders.
- Core refactor: NativeBatchReader
  - Adds FileInfo abstraction (URI normalization) and alternate constructor; fields widened to protected.
  - Supports preInitializedReaders (schema filtering, required-column checks, and reader reuse).
  - Arrow schema TZ fixed to UTC; Spark field matching by name; uses CometInputFile.fromPath(...) without footer; object store options via pathUri; key retriever keyed by filePath.
  - Helper methods: getColumnIndexFromParquetColumn and updated checkParquetType to allow preinitialized required columns.
- Column readers: AbstractColumnReader adds getPath().
- Utility: New ParquetMetadataSerializer for serialize/deserialize Parquet metadata bytes.
Build/Dependencies:
- Maven: add org.apache.parquet:parquet-format-structures (and exclude from Spark deps); wire into common/pom.xml and root pom.xml dependency management.
- Rust: bump uuid to 1.18.1 in native/core/Cargo.toml.

^{Written by Cursor Bugbot for commit d8cd7b7. This will update automatically on new commits. Configure here.}

coderabbitai · 2025-11-10T06:10:00Z

Walkthrough

Added parquet-format-structures Maven dependency across pom.xml files. Introduced IcebergCometNativeBatchReader for Iceberg integration and ParquetMetadataSerializer for metadata serialization. Enhanced NativeBatchReader with FileInfo abstraction and preinitialized readers support. Added helper method in AbstractColumnReader. Updated uuid dependency version in Cargo.toml.

Changes

Cohort / File(s)	Summary
Maven Dependencies `common/pom.xml`, `pom.xml`	Added parquet-format-structures dependency with version set to `${parquet.version}` and scope `${parquet.maven.scope}`. Excluded parquet-format-structures from transitive exclusions for spark-sql.
Parquet Reader Infrastructure `common/src/main/java/org/apache/comet/parquet/AbstractColumnReader.java`	Added package-private helper method `getPath()` returning dot-joined path from descriptor.
Parquet Reader Infrastructure `common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java`	Introduced public static inner class `FileInfo` with metadata encapsulation and `fromPartitionedFile()` factory method. Added `preInitializedReaders` field and initialization support. Modified field visibility to protected to enable subclassing. Added package-private constructor accepting `FileInfo` and protected no-arg constructor. Implemented `getColumnIndexFromParquetColumn()` helper. Updated initialization and batch loading to utilize preinitialized readers and FileInfo abstraction.
New Parquet Serialization & Iceberg Classes `common/src/main/java/org/apache/comet/parquet/IcebergCometNativeBatchReader.java`	Added new `IcebergCometNativeBatchReader` class extending `NativeBatchReader` for Iceberg integration. Includes constructor accepting `StructType requiredSchema`, `init()` method accepting Parquet metadata as Thrift-encoded bytes, and `getSparkSchema()` accessor. Uses `ParquetMetadataSerializer` for deserialization.
New Parquet Serialization & Iceberg Classes `common/src/main/java/org/apache/comet/parquet/ParquetMetadataSerializer.java`	Added utility class `ParquetMetadataSerializer` with serialization/deserialization methods converting `ParquetMetadata` to/from byte arrays using `ParquetMetadataConverter`.
Native Dependencies `native/core/Cargo.toml`	Updated uuid crate dependency version from "1.0" to "1.18.1".

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-2680-2025-11-10-06-09-43

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cursor · 2025-11-10T06:11:37Z

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

    try (FileReader fileReader =
        new FileReader(
-            CometInputFile.fromPath(path, conf), footer, readOptions, cometReadOptions, metrics)) {
+            CometInputFile.fromPath(path, conf), readOptions, cometReadOptions, metrics)) {


Bug: Optimize FileReader Footer Handling

The FileReader constructor call omits the footer parameter, causing the footer to be re-read from the file even though this.footer is already available. This wastes the pre-loaded footer metadata and defeats optimizations where the footer is passed in separately (like from Iceberg). The constructor should pass footer as the second parameter to use the cached metadata instead of re-reading from disk.

claude · 2025-11-10T06:11:51Z

Code Review: Iceberg Native Batch Reader Refactoring

Summary

This PR introduces Iceberg-specific Parquet reading capabilities with serialized metadata support, refactors the core NativeBatchReader to support FileInfo abstraction and pre-initialized readers, and updates dependencies. Overall, this is a well-structured refactoring that enables Iceberg integration.

Code Quality & Best Practices

Strengths

✅ Clean abstraction: The FileInfo class provides a good abstraction over PartitionedFile properties
✅ Backward compatibility: Changes maintain compatibility with existing code paths
✅ Proper encapsulation: Fields changed from private to protected appropriately for subclassing
✅ Good documentation: Classes have clear Javadoc comments explaining their purpose

Issues & Suggestions

1. URI Construction in FileInfo Constructor (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:120-122)

URI uri = new Path(filePath).toUri();
if (uri.getScheme() == null) {
  uri = new Path("file://" + filePath).toUri();
}

Issue: This logic assumes local files when scheme is missing, but the original path might be relative or have other intentions.
Recommendation: Consider logging a warning when defaulting to file:// scheme, or validate that the path is actually a local file.

2. Static Factory Method Naming (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:128)

public static FileInfo fromPartitionedFile(PartitionedFile file)

Suggestion: This is a good pattern. Consider adding a corresponding fromPath(...) factory method for consistency.

3. Timezone Hardcoded to UTC (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:542-543)

//      String timeZoneId = conf.get("spark.sql.session.timeZone");
String timeZoneId = "UTC";

Issue: The original timezone configuration is commented out and hardcoded to UTC. While the comment at line 543 says "Native code uses UTC always", this removes user flexibility.
Recommendation: Document why this is required or consider making it configurable if possible. If this is a known limitation, add a TODO or link to an issue.

4. Exception Handling in IcebergCometNativeBatchReader.init() (common/src/main/java/org/apache/comet/parquet/IcebergCometNativeBatchReader.java:56)

throws Throwable

Issue: Using Throwable is too broad and can catch errors that shouldn't be caught.
Recommendation: Use more specific exception types or at minimum Exception instead of Throwable.

5. Metrics Clearing (common/src/main/java/org/apache/comet/parquet/IcebergCometNativeBatchReader.java:73)

this.metrics.clear();
if (metrics != null) {
  this.metrics.putAll(metrics);
}

Issue: Clearing metrics before adding new ones could lose existing metrics if the init method is called multiple times.
Recommendation: Consider whether this should be additive or document that re-initialization clears previous metrics.

Potential Bugs & Issues

Critical Issues

1. Missing Null Check for dataType (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:999-1011)

DataType dataType = null;
// ... loop to find matching field ...
if (dataType == null) {
  throw new IOException(
      "Could not find matching Spark field for Parquet field: " + field.getName());
}

Issue: Good null check, but this could fail if sparkFields is null or empty.
Recommendation: Add a guard check: if (sparkFields == null || sparkFields.length == 0) before the loop.

2. Array Index Bounds (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:456)

if (preInitializedReaders != null && preInitializedReaders[i] != null) {

Issue: No bounds checking before accessing preInitializedReaders[i].
Recommendation: Add: i < preInitializedReaders.length && preInitializedReaders[i] != null

3. Potential IndexOutOfBounds (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:775-777)

int columnIndex = getColumnIndexFromParquetColumn(column);
if (columnIndex == -1
    || preInitializedReaders == null
    || preInitializedReaders[columnIndex] == null) {

Issue: If columnIndex != -1 but columnIndex >= preInitializedReaders.length, this will throw ArrayIndexOutOfBoundsException.
Recommendation: Add bounds check: columnIndex >= preInitializedReaders.length

Medium Issues

4. Schema Filtering Side Effect (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:523-541)

if (preInitializedReaders != null) {
  StructType filteredSchema = new StructType();
  // ... filtering logic ...
  sparkSchema = filteredSchema;
}

Issue: Mutating sparkSchema field could have unintended side effects if the schema is used elsewhere.
Recommendation: Consider using a local variable or making this more explicit with better naming.

5. Resource Cleanup in FileInfo.pathUri() (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:147-150)

public URI pathUri() throws Exception {
  URI uri = new URI(filePath);
  return uri;
}

Issue: The method signature throws Exception is too broad and the comment says it throws Exception but URI constructor throws URISyntaxException.
Recommendation: Change to throws URISyntaxException for more precise error handling.

Performance Considerations

Positive

✅ Metadata serialization: Reusing serialized Parquet metadata avoids redundant I/O operations
✅ Pre-initialized readers: Allows reader reuse, reducing initialization overhead
✅ Schema filtering: Filtering out columns with pre-initialized readers reduces work for native code

Concerns

1. Repeated String Operations (common/src/main/java/org/apache/comet/parquet/AbstractColumnReader.java:92-94)

String getPath() {
  return String.join(".", this.descriptor.getPath());
}

Issue: This creates a new string every time it's called. If called frequently, consider caching.
Recommendation: Cache the result if getPath() is called multiple times for the same descriptor.

2. Array Conversion in Hot Path (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:794)

String[] targetPath = asJava(column.path()).toArray(new String[0]);

Issue: Converting Scala collections to Java arrays on every call could be expensive.
Recommendation: Profile this code path. If it's hot, consider caching or optimizing the conversion.

3. Nested Loop for Field Matching (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:997-1005)

for (int j = 0; j < sparkFields.length; j++) {
  if (sparkFields[j].name().equals(field.getName())) {
    // ...
  }
}

Issue: O(n²) complexity when matching multiple fields.
Recommendation: Create a Map<String, StructField> once before the outer loop to achieve O(n) lookup.

Security Concerns

Medium Priority

1. Path Injection Risk (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:120-124)

URI uri = new Path(filePath).toUri();
if (uri.getScheme() == null) {
  uri = new Path("file://" + filePath).toUri();
}
this.filePath = uri.toString();

Issue: No validation of filePath input. Malicious paths could potentially access unauthorized files.
Recommendation: Validate file paths against expected patterns/directories, especially if paths come from external sources.

2. Serialized Data Deserialization (common/src/main/java/org/apache/comet/parquet/ParquetMetadataSerializer.java:69-72)

public ParquetMetadata deserialize(byte[] bytes) throws IOException {
  ByteArrayInputStream inputStream = new ByteArrayInputStream(bytes);
  FileMetaData fileMetaData = Util.readFileMetaData(inputStream);
  return converter.fromParquetMetadata(fileMetaData);
}

Issue: Deserializing untrusted byte arrays could be a security risk if the bytes come from untrusted sources.
Recommendation:

Document that input should be from trusted sources only
Consider adding size limits to prevent memory exhaustion attacks
Validate the deserialized metadata before use

3. Exception Information Disclosure (multiple locations)

throw new IOException("Required column is missing in data file. Col: " + Arrays.toString(path));

Issue: Stack traces and error messages might expose internal file system structure.
Recommendation: Consider sanitizing error messages in production environments.

Test Coverage

Critical Gap

❌ No test files found for new classes

IcebergCometNativeBatchReader has no corresponding test file
ParquetMetadataSerializer has no test coverage
FileInfo abstraction is not unit tested

Recommendations

1. Add Unit Tests for ParquetMetadataSerializer

// Suggested test cases:
- testSerializeDeserializeRoundTrip()
- testDeserializeEmptyBytes()
- testDeserializeCorruptedBytes()
- testSerializeLargeMetadata()

2. Add Tests for IcebergCometNativeBatchReader

// Suggested test cases:
- testInitWithValidMetadata()
- testInitWithInvalidMetadata()
- testPreInitializedReaders()
- testMissingRequiredColumnsWithPreInitReaders()

3. Add Tests for FileInfo

// Suggested test cases:
- testFileInfoFromPartitionedFile()
- testFileInfoWithMissingScheme()
- testFileInfoWithRelativePath()
- testPathUriGeneration()

4. Integration Tests

Test Iceberg table reading end-to-end
Test pre-initialized reader reuse
Test schema filtering with pre-initialized readers

Dependency Changes

Positive

✅ Added parquet-format-structures explicitly (good for clarity)
✅ Upgraded uuid crate to 1.18.1 in Rust
✅ Proper exclusion from Spark dependencies to avoid conflicts

Verification Needed

⚠️ Ensure parquet-format-structures version is compatible with other Parquet libraries
⚠️ Verify the uuid upgrade doesn't introduce breaking changes

Summary of Recommendations

Must Fix (Before Merge)

Fix array bounds checking for preInitializedReaders access (lines 456, 775-777)
Add null/empty check for sparkFields before loop (line 997)
Add test coverage for new classes

Should Fix

Change throws Throwable to more specific exception types
Fix pathUri() exception signature
Document or reconsider hardcoded UTC timezone
Add validation for file path inputs

Nice to Have

Performance optimization: cache frequently called computations
Create field name to StructField map to avoid O(n²) complexity
Add size limits for deserialization
Consider logging when defaulting to file:// scheme

Overall, this is a solid refactoring that enables important Iceberg functionality. The main concerns are around test coverage, array bounds checking, and input validation. With these addressed, this will be a strong addition to the codebase.

augmentcode

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

common/src/main/java/org/apache/comet/parquet/ParquetMetadataSerializer.java (1)
55-60: Replace magic number with ParquetFileWriter constant.

Using a hard-coded 1 risks drifting if Parquet bumps its writer version. Please delegate to the official constant to stay aligned automatically.

Apply this diff:
@@
-import org.apache.parquet.format.FileMetaData;
-import org.apache.parquet.format.Util;
-import org.apache.parquet.format.converter.ParquetMetadataConverter;
+import org.apache.parquet.format.FileMetaData;
+import org.apache.parquet.format.Util;
+import org.apache.parquet.format.converter.ParquetMetadataConverter;
+import org.apache.parquet.hadoop.ParquetFileWriter;
@@
-    FileMetaData fileMetaData = converter.toParquetMetadata(1, metadata);
+    FileMetaData fileMetaData =
+        converter.toParquetMetadata(ParquetFileWriter.CURRENT_VERSION, metadata);

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 074b676 and d8cd7b7.

📒 Files selected for processing (7)

common/pom.xml (1 hunks)
common/src/main/java/org/apache/comet/parquet/AbstractColumnReader.java (1 hunks)
common/src/main/java/org/apache/comet/parquet/IcebergCometNativeBatchReader.java (1 hunks)
common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java (11 hunks)
common/src/main/java/org/apache/comet/parquet/ParquetMetadataSerializer.java (1 hunks)
native/core/Cargo.toml (1 hunks)
pom.xml (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java (3)

common/src/main/scala/org/apache/comet/objectstore/NativeConfig.scala (2)

NativeConfig (30-88)

extractObjectStoreOptions (58-87)

common/src/main/java/org/apache/comet/parquet/ReadOptions.java (1)

ReadOptions (36-210)

common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala (1)

toArrowSchema (191-195)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: Cursor Bugbot
GitHub Check: codex
GitHub Check: claude-review
GitHub Check: spark-sql-catalyst/ubuntu-24.04/spark-4.0.1/java-17
GitHub Check: spark-sql-sql/hive-1/ubuntu-24.04/spark-4.0.1/java-17
GitHub Check: spark-sql-sql/core-2/ubuntu-24.04/spark-4.0.1/java-17
GitHub Check: spark-sql-sql/core-1/ubuntu-24.04/spark-4.0.1/java-17
GitHub Check: spark-sql-sql/hive-3/ubuntu-24.04/spark-4.0.1/java-17
GitHub Check: spark-sql-sql/hive-2/ubuntu-24.04/spark-4.0.1/java-17
GitHub Check: spark-sql-sql/core-3/ubuntu-24.04/spark-4.0.1/java-17

coderabbitai · 2025-11-10T06:22:21Z

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

+          // Keep the column if:
+          // 1. It doesn't have a preinitialized reader, OR
+          // 2. It has a preinitialized reader but exists in fileSchema
+          boolean hasPreInitializedReader =
+              i < preInitializedReaders.length && preInitializedReaders[i] != null;
+          int finalI = i;
+          boolean existsInFileSchema =
+              fileFields.stream().anyMatch(f -> f.getName().equals(sparkFields[finalI].name()));
+
+          if (!hasPreInitializedReader || existsInFileSchema) {
+            filteredSchema = filteredSchema.add(sparkFields[i]);
+          }
+        }
+        sparkSchema = filteredSchema;


⚠️ Potential issue | 🟠 Major

Respect case-insensitive schema resolution when filtering Spark fields

Line 535 currently does a hard equals between Spark and Parquet column names. In sessions where isCaseSensitive is false (Spark’s default), legitimate columns whose casing differs (common with Hive/Iceberg metadata) are treated as “missing”, so we drop them from sparkSchema and later fail to produce data for them. Please compare using the same resolver as the rest of the reader—e.g., branch on isCaseSensitive and fall back to equalsIgnoreCase—so case-insensitive schemas keep working.

🤖 Prompt for AI Agents

In common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java around lines 528 to 541, the name comparison when deciding whether a Spark field exists in the Parquet file uses a case-sensitive equals which breaks behaviour when Spark is configured case-insensitive; change the existence check to respect the reader’s case-sensitivity setting (i.e., if isCaseSensitive use equals, otherwise use a case-insensitive comparison like equalsIgnoreCase or the same resolver used elsewhere in this class) so fields that differ only by case are not dropped from filteredSchema; update the anyMatch predicate accordingly to use that conditional comparison.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct that Apache Spark has a special configuration setting to decide whether to compare the field names case sensitive or not. The finding prevents issues in Parquet <-> Spark field mappings.

@martin-augment, thank you for confirming the finding! I'm glad the case-sensitivity analysis was helpful in identifying a potential issue with Parquet ↔ Spark field mappings. Let me know if you need any assistance implementing the fix or have questions about the suggested approach.

coderabbitai · 2025-11-10T06:22:22Z

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

+        // Find the corresponding spark field by matching field names
+        DataType dataType = null;
+        int sparkSchemaIndex = -1;
+        for (int j = 0; j < sparkFields.length; j++) {
+          if (sparkFields[j].name().equals(field.getName())) {
+            dataType = sparkFields[j].dataType();
+            sparkSchemaIndex = j;
+            break;
+          }
+        }
+
+        if (dataType == null) {
+          throw new IOException(
+              "Could not find matching Spark field for Parquet field: " + field.getName());
+        }


⚠️ Potential issue | 🟠 Major

Column lookup must honor case-insensitive mode

When we locate the Spark field for a Parquet column, we again rely on a case-sensitive equals. With isCaseSensitive == false, the lookup falls through and throws IOException: Could not find matching Spark field... even though the column exists (only case differs). Please resolve using the configured resolver (e.g., isCaseSensitive ? equals : equalsIgnoreCase) before throwing, otherwise Iceberg/Hive workloads in the default case-insensitive mode break.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct that Apache Spark has a special configuration setting to decide whether to compare the field names case sensitive or not. The finding prevents issues in Parquet <-> Spark field mappings.

@martin-augment, thank you for confirming this finding! I'm glad the review helped identify this issue before it could cause problems with case-insensitive Parquet field mappings in production workloads. Your feedback helps me improve future reviews. 🙏

(\___/) (='.'=) Happy to help catch that bug! (")_(")

✏️ Learnings added

Learnt from: martin-augment Repo: martin-augment/datafusion-comet PR: 14 File: common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:998-1012 Timestamp: 2025-11-10T07:10:16.234Z Learning: In NativeBatchReader.java (Java, Apache Comet Parquet reader), when matching Spark field names to Parquet field names, always respect the `isCaseSensitive` configuration flag. Use case-sensitive comparison (`equals`) when `isCaseSensitive == true`, and case-insensitive comparison (`equalsIgnoreCase`) when `isCaseSensitive == false`. This prevents field lookup failures in Iceberg/Hive workloads that run in the default case-insensitive mode.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

github-actions · 2025-11-10T06:23:43Z

Findings

[High] common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:545 – The batch reader now hard-codes the session timezone to "UTC" when calling Native.initRecordBatchReader. Prior code passed conf.get("spark.sql.session.timeZone"), and the native side uses that string to build SparkParquetOptions. With this change, any non-UTC session (e.g. America/Los_Angeles) will read timestamps as if they were stored in UTC, producing incorrect values by the local offset. Please restore the real session timezone (or otherwise plumb it through) before invoking the native reader.

martin-augment · 2025-11-10T07:12:08Z

[High] common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:545 – The batch reader now hard-codes the session timezone to "UTC" when calling Native.initRecordBatchReader. Prior code passed conf.get("spark.sql.session.timeZone"), and the native side uses that string to build SparkParquetOptions. With this change, any non-UTC session (e.g. America/Los_Angeles) will read timestamps as if they were stored in UTC, producing incorrect values by the local offset. Please restore the real session timezone (or otherwise plumb it through) before invoking the native reader.

value:useful; category:bug; feedback:The Codex AI reviewer is correct that the PR changes the timezone from the configured one (conf.get("spark.sql.session.timeZone")) to always be "UTC". If it always need to be UTC then at least a comment should be added explaining why. The finding prevents issues in the calculations of date/time related values.

martin-augment · 2025-11-10T07:19:51Z

1. Missing Null Check for dataType (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:999-1011)
DataType dataType = null;
// ... loop to find matching field ...
if (dataType == null) {
  throw new IOException(
      "Could not find matching Spark field for Parquet field: " + field.getName());
}
Issue: Good null check, but this could fail if sparkFields is null or empty. Recommendation: Add a guard check: if (sparkFields == null || sparkFields.length == 0) before the loop.

value:incorrect-but-reasonable; category:bug; feedback: The Claude AI reviewer is not correct. Spark's StructType#fields() returns an empty array, so there is no chance for NullPointerException (https://github.com/apache/spark/blob/79d0c427878f3fc9a95375679934ce6d59595014/sql/api/src/main/scala/org/apache/spark/sql/types/StructType.scala#L108)

martin-augment · 2025-11-10T07:21:53Z

2. Array Index Bounds (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:456)
if (preInitializedReaders != null && preInitializedReaders[i] != null) {
Issue: No bounds checking before accessing preInitializedReaders[i]. Recommendation: Add: i < preInitializedReaders.length && preInitializedReaders[i] != null

value:useful; category:bug; feedback:The Claude AI reviewer is correct that there is no check whether preInitializedReaders has an item for i. The finding prevents an ArrayIndexOutOfBounds runtime exception

martin-augment · 2025-11-10T07:22:36Z

3. Potential IndexOutOfBounds (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:775-777)
int columnIndex = getColumnIndexFromParquetColumn(column);
if (columnIndex == -1
    || preInitializedReaders == null
    || preInitializedReaders[columnIndex] == null) {
Issue: If columnIndex != -1 but columnIndex >= preInitializedReaders.length, this will throw ArrayIndexOutOfBoundsException. Recommendation: Add bounds check: columnIndex >= preInitializedReaders.length

value:useful; category:bug; feedback:The Claude AI reviewer is correct that there is no check whether preInitializedReaders has an item for i. The finding prevents an ArrayIndexOutOfBounds runtime exception

martin-augment · 2025-11-10T07:26:29Z

5. Resource Cleanup in FileInfo.pathUri() (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:147-150)
public URI pathUri() throws Exception {
  URI uri = new URI(filePath);
  return uri;
}
Issue: The method signature throws Exception is too broad and the comment says it throws Exception but URI constructor throws URISyntaxException. Recommendation: Change to throws URISyntaxException for more precise error handling.

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that the method's throws clause is broader than it need to be. Since the method could throw only one type of exception (URISyntaxException) it is better to be specific here instead of using the more generic java.lang.Exception. This way the callers of this method could handle it in a more specific way.

martin-augment · 2025-11-10T07:29:18Z

1. Repeated String Operations (common/src/main/java/org/apache/comet/parquet/AbstractColumnReader.java:92-94)
String getPath() {
  return String.join(".", this.descriptor.getPath());
}
Issue: This creates a new string every time it's called. If called frequently, consider caching. Recommendation: Cache the result if getPath() is called multiple times for the same descriptor.

2. Array Conversion in Hot Path (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:794)
String[] targetPath = asJava(column.path()).toArray(new String[0]);
Issue: Converting Scala collections to Java arrays on every call could be expensive. Recommendation: Profile this code path. If it's hot, consider caching or optimizing the conversion.

value:good-but-wont-fix; category:bug; feedback:The Claude AI reviewer is correct that these lines in the code might be expensive if used in hot path but usually there is no need to optimize something until it shows in the CPU profiler as slow.

martin-augment · 2025-11-10T07:31:23Z

3. Nested Loop for Field Matching (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:997-1005)
for (int j = 0; j < sparkFields.length; j++) {
  if (sparkFields[j].name().equals(field.getName())) {
    // ...
  }
}
Issue: O(n²) complexity when matching multiple fields. Recommendation: Create a Map<String, StructField> once before the outer loop to achieve O(n) lookup.

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that these loops could be optimized to execute faster. But this depends on whether the fields should be compared always in a case sensitive way or not. The finding prevents slower execution when comparing the Parquet fields with the Spark ones.

parthchandra added 3 commits November 3, 2025 11:17

chore: various refactoring changes for iceberg

d73bcba

address comments

d12f4f4

revert change to opendal version

d8cd7b7

cursor bot reviewed Nov 10, 2025

View reviewed changes

augmentcode bot reviewed Nov 10, 2025

View reviewed changes

coderabbitai bot reviewed Nov 10, 2025

View reviewed changes

2680: chore: various refactoring changes for iceberg [iceberg] #14

Are you sure you want to change the base?

2680: chore: various refactoring changes for iceberg [iceberg] #14

Uh oh!

Conversation

martin-augment commented Nov 10, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

cursor bot Nov 10, 2025

Choose a reason for hiding this comment

Bug: Optimize FileReader Footer Handling

Uh oh!

claude bot commented Nov 10, 2025

Code Review: Iceberg Native Batch Reader Refactoring

Summary

Code Quality & Best Practices

Strengths

Issues & Suggestions

1. URI Construction in FileInfo Constructor (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:120-122)

2. Static Factory Method Naming (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:128)

3. Timezone Hardcoded to UTC (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:542-543)

4. Exception Handling in IcebergCometNativeBatchReader.init() (common/src/main/java/org/apache/comet/parquet/IcebergCometNativeBatchReader.java:56)

5. Metrics Clearing (common/src/main/java/org/apache/comet/parquet/IcebergCometNativeBatchReader.java:73)

Potential Bugs & Issues

Critical Issues

1. Missing Null Check for dataType (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:999-1011)

2. Array Index Bounds (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:456)

3. Potential IndexOutOfBounds (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:775-777)

Medium Issues

4. Schema Filtering Side Effect (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:523-541)

5. Resource Cleanup in FileInfo.pathUri() (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:147-150)

Performance Considerations

Positive

Concerns

1. Repeated String Operations (common/src/main/java/org/apache/comet/parquet/AbstractColumnReader.java:92-94)

2. Array Conversion in Hot Path (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:794)

3. Nested Loop for Field Matching (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:997-1005)

Security Concerns

Medium Priority

1. Path Injection Risk (common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java:120-124)

2. Serialized Data Deserialization (common/src/main/java/org/apache/comet/parquet/ParquetMetadataSerializer.java:69-72)

3. Exception Information Disclosure (multiple locations)

Test Coverage

Critical Gap

Recommendations

1. Add Unit Tests for ParquetMetadataSerializer

2. Add Tests for IcebergCometNativeBatchReader

3. Add Tests for FileInfo

4. Integration Tests

Dependency Changes

Positive

Verification Needed

Summary of Recommendations

Must Fix (Before Merge)

Should Fix

Nice to Have

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-augment Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

martin-augment commented Nov 10, 2025 •

edited by cursor bot

Loading

coderabbitai bot commented Nov 10, 2025 •

edited

Loading

Bug: Optimize `FileReader` Footer Handling

coderabbitai bot Nov 10, 2025 •

edited

Loading

coderabbitai bot Nov 10, 2025 •

edited

Loading