-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[BugFix] Fix schema conversion from two-level encoding nested list #62362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[BugFix] Fix schema conversion from two-level encoding nested list #62362
Conversation
|
@cursor review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Bugbot reviewed your changes and found no bugs!
[BE Incremental Coverage Report]✅ pass : 29 / 31 (93.55%) file detail
|
|
Is this PR to fix the same issue with #64160 ? If it is, could we close this PR? |
no that PR only fixed from files() but this PR fixed from hive table |
|
SR be/src/formats/parquet/schema.cpp have code initially developed in arrow and adopted to SR logic // The schema resolve logic is copied from https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/schema.cc
Status SchemaDescriptor::from_thrift(const std::vector<tparquet::SchemaElement>& t_schemas, bool case_sensitive) {called from FileMetadata Status FileMetaData::init(tparquet::FileMetaData& t_metadata, bool case_sensitive) {
// construct schema from thrift
RETURN_IF_ERROR(_schema.from_thrift(t_metadata.schema, case_sensitive));but this code contain some bugs and fix landed one year ago in apache/arrow#43995 we take fix and adopt it to SR codebase but fix for #64160 have another stacktrace it's from cases when you use files() and read samples from files to derive schema information as i remember reader in files() and reader for hive / iceberg have slight different paths to work with parquet in our case it was error when we try to query external table and got errors without crash during the investigation we found original issue in SR and after it FIX in arrow project |
|
@mergify rebase |
✅ Branch has been successfully rebased |
0de25bf to
038a0e5
Compare
|
@cursor review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Bugbot reviewed your changes and found no bugs!
|
@jenwitteng the compilation failed, could you fix it? |
Fixed. |
|
@mergify rebase |
Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>
Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>
Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>
Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>
✅ Branch has been successfully rebased |
0330304 to
09c85d3
Compare
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
|
@jenwitteng UT failure, could you fix them? |
Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>
Fixed |
|
@cursor review |
Signed-off-by: Amonpongitsara, Jenwit <jenwit.amonpongitsara@agoda.com>
|
@cursor review |
🧪 CI InsightsHere's what we observed from your CI run for a68d79a. 🟢 All jobs passed!But CI Insights is watching 👀 |
| static const Slice tuple_slice("_tuple", 6); | ||
| Slice slice(name); | ||
| return slice == array_slice || slice.ends_with(tuple_slice); | ||
| return slice == array_slice || name == (parent + "_tuple"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Stricter tuple name matching breaks backward compatibility
The new has_list_element_name function changes the _tuple suffix matching behavior in a backward-incompatible way. The old function has_struct_list_name checked slice.ends_with("_tuple"), matching any name ending in _tuple. The new function requires an exact match of parent + "_tuple". This breaks existing test cases like my_list_7 where the parent is "my_list_7" but the child is named "my_list_tuple" (not "my_list_7_tuple"). Such schemas will now be incorrectly parsed as three-level list encoding instead of being recognized as the struct-wrapped list pattern.
Why I'm doing:
Copied logic from here
apache/arrow#43995
[C++][Parquet] Fix schema conversion from two-level encoding nested list
What I'm doing:
Fixes #issue
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check:
Note
Fixes Parquet schema conversion to properly parse nested lists (including two-level encoding) and lists of maps, with accompanying unit tests.
be/src/formats/parquet/schema.cpp)has_list_element_name(name, parent)and adjust logic forarrayand$parent_tuplenaming.list_to_fieldto:arrayor$PARENT_tuple; otherwise parse inner node directly.be/test/formats/parquet/parquet_schema_test.cpp)List<List<Integer>>and three-levelList<Map<String,String>>conversions.Written by Cursor Bugbot for commit a68d79a. This will update automatically on new commits. Configure here.