Skip to content

Conversation

@jenwitteng
Copy link

@jenwitteng jenwitteng commented Aug 27, 2025

Why I'm doing:

Copied logic from here
apache/arrow#43995
[C++][Parquet] Fix schema conversion from two-level encoding nested list

What I'm doing:

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.1
    • 4.0
    • 3.5
    • 3.4
    • 3.3

Note

Fixes Parquet schema conversion to properly parse nested lists (including two-level encoding) and lists of maps, with accompanying unit tests.

  • Parquet schema conversion (be/src/formats/parquet/schema.cpp)
    • Refactor special-case helper to has_list_element_name(name, parent) and adjust logic for array and $parent_tuple naming.
    • Enhance list_to_field to:
      • Support two-level nested list encoding (repeated child is itself repeated) when LIST-annotated.
      • Treat repeated groups with multiple children as struct elements.
      • Use struct for single-child groups named array or $PARENT_tuple; otherwise parse inner node directly.
      • Add stricter validation (e.g., LIST groups not repeated; must be LIST-annotated; at least one child).
    • Minor comment tweak in map handling.
  • Tests (be/test/formats/parquet/parquet_schema_test.cpp)
    • Add cases for two-level List<List<Integer>> and three-level List<Map<String,String>> conversions.
    • Update expected field structures accordingly; minor comment/style fixes.

Written by Cursor Bugbot for commit a68d79a. This will update automatically on new commits. Configure here.

@alvin-celerdata
Copy link
Contributor

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no bugs!


trueeyu
trueeyu previously approved these changes Sep 1, 2025
@github-actions
Copy link

github-actions bot commented Sep 1, 2025

[BE Incremental Coverage Report]

pass : 29 / 31 (93.55%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 src/formats/parquet/schema.cpp 29 31 93.55% [177, 186]

@alvin-celerdata
Copy link
Contributor

Is this PR to fix the same issue with #64160 ? If it is, could we close this PR?

@jenwitteng
Copy link
Author

Is this PR to fix the same issue with #64160 ? If it is, could we close this PR?

no that PR only fixed from files() but this PR fixed from hive table

@xhumanoid
Copy link
Contributor

@alvin-celerdata

SR be/src/formats/parquet/schema.cpp have code initially developed in arrow and adopted to SR logic

// The schema resolve logic is copied from https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc
Status SchemaDescriptor::from_thrift(const std::vector<tparquet::SchemaElement>& t_schemas, bool case_sensitive) {

called from FileMetadata

Status FileMetaData::init(tparquet::FileMetaData& t_metadata, bool case_sensitive) {
    // construct schema from thrift
    RETURN_IF_ERROR(_schema.from_thrift(t_metadata.schema, case_sensitive));

but this code contain some bugs and fix landed one year ago in apache/arrow#43995

we take fix and adopt it to SR codebase
this fix more generic because apply for schema read directly from parquet file
const std::vector<tparquet::SchemaElement>& t_schemas

but fix for #64160 have another stacktrace

starrocks::ParquetReaderWrap::get_schema(std::vector<starrocks::SlotDescriptor, std::allocator<starrocks::SlotDescriptor> >*)
starrocks::ParquetScanner::get_schema(std::vector<starrocks::SlotDescriptor, std::allocator<starrocks::SlotDescriptor> >*)
starrocks::FileScanner::sample_schema(starrocks::RuntimeState*, starrocks::TBrokerScanRange const&, std::vector<starrocks::SlotDescriptor, std::allocator<starrocks::SlotDescriptor> >*)
starrocks::PInternalServiceImplBase<starrocks::PInternalService>::_get_file_schema(google::protobuf::RpcController*, starrocks::PGetFileSchemaRequest const*, starrocks::PGetFileSchemaResult*, google::protobuf::Closure*)

it's from cases when you use files() and read samples from files to derive schema information

as i remember reader in files() and reader for hive / iceberg have slight different paths to work with parquet

in our case it was error when we try to query external table and got errors without crash

SQL Error [1064] [42000]: ParquetField 'array' file's type struct is different from table's type ARRAY: file = s3a://bucket/table1/file.parquet: BE:197024

during the investigation we found original issue in SR and after it FIX in arrow project

@alvin-celerdata
Copy link
Contributor

@mergify rebase

@mergify
Copy link
Contributor

mergify bot commented Nov 6, 2025

rebase

✅ Branch has been successfully rebased

@alvin-celerdata alvin-celerdata force-pushed the fix-schema-conversion-two-level-encoding branch from 0de25bf to 038a0e5 Compare November 6, 2025 21:51
@alvin-celerdata
Copy link
Contributor

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no bugs!


@alvin-celerdata
Copy link
Contributor

@jenwitteng the compilation failed, could you fix it?

@jenwitteng
Copy link
Author

@jenwitteng the compilation failed, could you fix it?

Fixed.

@alvin-celerdata
Copy link
Contributor

@mergify rebase

Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>
Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>
Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>
Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>
Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>
@mergify
Copy link
Contributor

mergify bot commented Nov 7, 2025

rebase

✅ Branch has been successfully rebased

@alvin-celerdata alvin-celerdata force-pushed the fix-schema-conversion-two-level-encoding branch from 0330304 to 09c85d3 Compare November 7, 2025 18:00
@github-actions
Copy link

github-actions bot commented Nov 7, 2025

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link

github-actions bot commented Nov 7, 2025

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

@alvin-celerdata
Copy link
Contributor

@jenwitteng UT failure, could you fix them?

Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>
@jenwitteng
Copy link
Author

@jenwitteng UT failure, could you fix them?

Fixed

@alvin-celerdata
Copy link
Contributor

@cursor review

Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>
@xhumanoid
Copy link
Contributor

@alvin-celerdata

@alvin-celerdata
Copy link
Contributor

@cursor review

@mergify
Copy link
Contributor

mergify bot commented Dec 9, 2025

🧪 CI Insights

Here's what we observed from your CI run for a68d79a.

🟢 All jobs passed!

But CI Insights is watching 👀

static const Slice tuple_slice("_tuple", 6);
Slice slice(name);
return slice == array_slice || slice.ends_with(tuple_slice);
return slice == array_slice || name == (parent + "_tuple");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Stricter tuple name matching breaks backward compatibility

The new has_list_element_name function changes the _tuple suffix matching behavior in a backward-incompatible way. The old function has_struct_list_name checked slice.ends_with("_tuple"), matching any name ending in _tuple. The new function requires an exact match of parent + "_tuple". This breaks existing test cases like my_list_7 where the parent is "my_list_7" but the child is named "my_list_tuple" (not "my_list_7_tuple"). Such schemas will now be incorrectly parsed as three-level list encoding instead of being recognized as the struct-wrapped list pattern.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants