[BugFix] Fix schema conversion from two-level encoding nested list #62362

jenwitteng · 2025-08-27T03:15:10Z

Why I'm doing:

Copied logic from here
apache/arrow#43995
[C++][Parquet] Fix schema conversion from two-level encoding nested list

What I'm doing:

Fixes #issue

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

Note

Fixes Parquet schema conversion to properly parse nested lists (including two-level encoding) and lists of maps, with accompanying unit tests.

Parquet schema conversion (be/src/formats/parquet/schema.cpp)
- Refactor special-case helper to has_list_element_name(name, parent) and adjust logic for array and $parent_tuple naming.
- Enhance list_to_field to:
  - Support two-level nested list encoding (repeated child is itself repeated) when LIST-annotated.
  - Treat repeated groups with multiple children as struct elements.
  - Use struct for single-child groups named array or $PARENT_tuple; otherwise parse inner node directly.
  - Add stricter validation (e.g., LIST groups not repeated; must be LIST-annotated; at least one child).
- Minor comment tweak in map handling.
Tests (be/test/formats/parquet/parquet_schema_test.cpp)
- Add cases for two-level List<List<Integer>> and three-level List<Map<String,String>> conversions.
- Update expected field structures accordingly; minor comment/style fixes.

^{Written by Cursor Bugbot for commit a68d79a. This will update automatically on new commits. Configure here.}

alvin-celerdata · 2025-09-01T06:21:41Z

@cursor review

cursor

✅ Bugbot reviewed your changes and found no bugs!

github-actions · 2025-09-01T07:48:59Z

[BE Incremental Coverage Report]

✅ pass : 29 / 31 (93.55%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	src/formats/parquet/schema.cpp	29	31	93.55%	[177, 186]

alvin-celerdata · 2025-10-24T18:19:39Z

Is this PR to fix the same issue with #64160 ? If it is, could we close this PR?

jenwitteng · 2025-10-27T02:16:48Z

Is this PR to fix the same issue with #64160 ? If it is, could we close this PR?

no that PR only fixed from files() but this PR fixed from hive table

xhumanoid · 2025-10-28T06:08:08Z

@alvin-celerdata

SR be/src/formats/parquet/schema.cpp have code initially developed in arrow and adopted to SR logic

// The schema resolve logic is copied from https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc
Status SchemaDescriptor::from_thrift(const std::vector<tparquet::SchemaElement>& t_schemas, bool case_sensitive) {

called from FileMetadata

Status FileMetaData::init(tparquet::FileMetaData& t_metadata, bool case_sensitive) {
    // construct schema from thrift
    RETURN_IF_ERROR(_schema.from_thrift(t_metadata.schema, case_sensitive));

but this code contain some bugs and fix landed one year ago in apache/arrow#43995

we take fix and adopt it to SR codebase
this fix more generic because apply for schema read directly from parquet file
const std::vector<tparquet::SchemaElement>& t_schemas

but fix for #64160 have another stacktrace

starrocks::ParquetReaderWrap::get_schema(std::vector<starrocks::SlotDescriptor, std::allocator<starrocks::SlotDescriptor> >*)
starrocks::ParquetScanner::get_schema(std::vector<starrocks::SlotDescriptor, std::allocator<starrocks::SlotDescriptor> >*)
starrocks::FileScanner::sample_schema(starrocks::RuntimeState*, starrocks::TBrokerScanRange const&, std::vector<starrocks::SlotDescriptor, std::allocator<starrocks::SlotDescriptor> >*)
starrocks::PInternalServiceImplBase<starrocks::PInternalService>::_get_file_schema(google::protobuf::RpcController*, starrocks::PGetFileSchemaRequest const*, starrocks::PGetFileSchemaResult*, google::protobuf::Closure*)

it's from cases when you use files() and read samples from files to derive schema information

as i remember reader in files() and reader for hive / iceberg have slight different paths to work with parquet

in our case it was error when we try to query external table and got errors without crash

SQL Error [1064] [42000]: ParquetField 'array' file's type struct is different from table's type ARRAY: file = s3a://bucket/table1/file.parquet: BE:197024

during the investigation we found original issue in SR and after it FIX in arrow project

alvin-celerdata · 2025-11-06T21:51:08Z

@mergify rebase

mergify · 2025-11-06T21:51:49Z

rebase

✅ Branch has been successfully rebased

alvin-celerdata · 2025-11-06T21:52:01Z

@cursor review

cursor

✅ Bugbot reviewed your changes and found no bugs!

alvin-celerdata · 2025-11-07T02:13:32Z

@jenwitteng the compilation failed, could you fix it?

jenwitteng · 2025-11-07T03:03:45Z

@jenwitteng the compilation failed, could you fix it?

Fixed.

alvin-celerdata · 2025-11-07T17:59:44Z

@mergify rebase

Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>

mergify · 2025-11-07T18:00:17Z

rebase

✅ Branch has been successfully rebased

be/test/formats/parquet/parquet_schema_test.cpp

github-actions · 2025-11-07T19:14:00Z

[Java-Extensions Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2025-11-07T19:14:05Z

[FE Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

alvin-celerdata · 2025-11-08T03:58:12Z

@jenwitteng UT failure, could you fix them?

Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>

jenwitteng · 2025-11-10T02:25:43Z

@jenwitteng UT failure, could you fix them?

Fixed

alvin-celerdata · 2025-11-10T18:46:22Z

@cursor review

be/src/formats/parquet/schema.cpp

Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>

xhumanoid · 2025-12-09T05:16:34Z

@alvin-celerdata

alvin-celerdata · 2025-12-09T06:27:34Z

@cursor review

mergify · 2025-12-09T06:29:14Z

🧪 CI Insights

Here's what we observed from your CI run for a68d79a.

🟢 All jobs passed!

But CI Insights is watching 👀

cursor · 2025-12-09T06:35:24Z

be/src/formats/parquet/schema.cpp

-    static const Slice tuple_slice("_tuple", 6);
    Slice slice(name);
-    return slice == array_slice || slice.ends_with(tuple_slice);
+    return slice == array_slice || name == (parent + "_tuple");


Bug: Stricter tuple name matching breaks backward compatibility

The new has_list_element_name function changes the _tuple suffix matching behavior in a backward-incompatible way. The old function has_struct_list_name checked slice.ends_with("_tuple"), matching any name ending in _tuple. The new function requires an exact match of parent + "_tuple". This breaks existing test cases like my_list_7 where the parent is "my_list_7" but the child is named "my_list_tuple" (not "my_list_7_tuple"). Such schemas will now be incorrectly parsed as three-level list encoding instead of being recognized as the struct-wrapped list pattern.

jenwitteng requested a review from a team as a code owner August 27, 2025 03:15

github-actions bot added 3.4 4.0 3.3 3.5 labels Aug 27, 2025

mergify bot assigned jenwitteng Aug 27, 2025

cursor bot reviewed Sep 1, 2025

View reviewed changes

trueeyu previously approved these changes Sep 1, 2025

View reviewed changes

xhumanoid mentioned this pull request Oct 22, 2025

[Enhancement] Support parquet list legacy encodings in files() #64160

Merged

23 tasks

jenwitteng dismissed trueeyu’s stale review via 0de25bf October 22, 2025 07:39

alvin-celerdata force-pushed the fix-schema-conversion-two-level-encoding branch from 0de25bf to 038a0e5 Compare November 6, 2025 21:51

cursor bot reviewed Nov 6, 2025

View reviewed changes

jenwitteng added 5 commits November 7, 2025 18:00

[BugFix] Fix schema conversion from two-level encoding nested list

fa90c8e

Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>

Fix Clang

69cb65a

Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>

Fix UT and rerun Clang

261459f

Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>

Fix UT and rerun Clang2

7d06bf5

Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>

Fix UT and rerun Clang3

09c85d3

Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>

alvin-celerdata force-pushed the fix-schema-conversion-two-level-encoding branch from 0330304 to 09c85d3 Compare November 7, 2025 18:00

cursor bot reviewed Nov 7, 2025

View reviewed changes

be/test/formats/parquet/parquet_schema_test.cpp Show resolved Hide resolved

Fix UT and rerun Clang4

72213fc

Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>

cursor bot reviewed Nov 10, 2025

View reviewed changes

be/src/formats/parquet/schema.cpp Show resolved Hide resolved

Premature Access: Incorrect Data Validation

a68d79a

Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>

cursor bot reviewed Dec 9, 2025

View reviewed changes

[BugFix] Fix schema conversion from two-level encoding nested list #62362

Are you sure you want to change the base?

[BugFix] Fix schema conversion from two-level encoding nested list #62362

Conversation

jenwitteng commented Aug 27, 2025 • edited by wanpengfei-git Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why I'm doing:

What I'm doing:

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

Uh oh!

alvin-celerdata commented Sep 1, 2025

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

✅ Bugbot reviewed your changes and found no bugs!

Uh oh!

github-actions bot commented Sep 1, 2025

[BE Incremental Coverage Report]

file detail

Uh oh!

alvin-celerdata commented Oct 24, 2025

Uh oh!

jenwitteng commented Oct 27, 2025

Uh oh!

xhumanoid commented Oct 28, 2025

Uh oh!

alvin-celerdata commented Nov 6, 2025

Uh oh!

mergify bot commented Nov 6, 2025

✅ Branch has been successfully rebased

Uh oh!

alvin-celerdata commented Nov 6, 2025

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

✅ Bugbot reviewed your changes and found no bugs!

Uh oh!

alvin-celerdata commented Nov 7, 2025

Uh oh!

jenwitteng commented Nov 7, 2025

Uh oh!

alvin-celerdata commented Nov 7, 2025

Uh oh!

mergify bot commented Nov 7, 2025

✅ Branch has been successfully rebased

Uh oh!

Uh oh!

github-actions bot commented Nov 7, 2025

[Java-Extensions Incremental Coverage Report]

Uh oh!

github-actions bot commented Nov 7, 2025

[FE Incremental Coverage Report]

Uh oh!

alvin-celerdata commented Nov 8, 2025

Uh oh!

jenwitteng commented Nov 10, 2025

Uh oh!

alvin-celerdata commented Nov 10, 2025

Uh oh!

Uh oh!

xhumanoid commented Dec 9, 2025

Uh oh!

alvin-celerdata commented Dec 9, 2025

Uh oh!

mergify bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 CI Insights

🟢 All jobs passed!

Uh oh!

cursor bot Dec 9, 2025

Choose a reason for hiding this comment

Bug: Stricter tuple name matching breaks backward compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

jenwitteng commented Aug 27, 2025 •

edited by wanpengfei-git

Loading

mergify bot commented Dec 9, 2025 •

edited

Loading