Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Stream load support JSON to STRUCT/MAP/ARRAY #45406

Merged
merged 13 commits into from
May 29, 2024

Conversation

rickif
Copy link
Contributor

@rickif rickif commented May 10, 2024

Why I'm doing:

Now, the SR does't support loading JSON to complex type column.

What I'm doing:

This PR adds the support of JSON format loading to STRUCT/MAP/ARRAY type column.

Here are some examples.

create table tbl_complex (col_int int, col_struct struct<key1 int>, col_array array<int>, col_map map<string,int>, col_struct_array struct<key3 array<int>>,
col_struct_map struct<key4 map<string,int>>);


curl --location-trusted -u root:  'http://127.0.0.1:18040/api/db0/tbl_complex/_stream_load' \-X PUT  \-H 'format: json' -d '{"col_int":1,"col_struct":{"key1":1}, "col_map":{"key2":2}, "col_array":[1,2,3], "col_struct_array":{"key3":[2,3,4,5]}, "col_struct_map":{"key4": {"key5":4}}}'

This feature also works for routine load.

Fixes #43101

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.3
    • 3.2
    • 3.1
    • 3.0
    • 2.5

@rickif rickif requested review from a team as code owners May 10, 2024 04:03
@mergify mergify bot assigned rickif May 10, 2024
@rickif rickif force-pushed the feat/json-to-struct branch from c50c31d to 6fe62fc Compare May 11, 2024 12:04
@rickif rickif requested a review from a team as a code owner May 11, 2024 12:04
@rickif
Copy link
Contributor Author

rickif commented May 12, 2024

@Mergifyio rebase

Copy link
Contributor

mergify bot commented May 12, 2024

rebase

✅ Branch has been successfully rebased

@rickif rickif force-pushed the feat/json-to-struct branch 6 times, most recently from 08a04c8 to 5153047 Compare May 13, 2024 06:34
be/src/formats/json/nullable_column.cpp Show resolved Hide resolved
be/src/formats/json/nullable_column.cpp Outdated Show resolved Hide resolved
be/src/formats/json/struct_column.cpp Show resolved Hide resolved
be/src/formats/json/map_column.cpp Outdated Show resolved Hide resolved
@@ -194,6 +194,12 @@ Status JsonScanner::_construct_json_types() {
break;
}

case TYPE_STRUCT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it support nested Array with struct or map as it's subfield data type?
It seems that it will treat all other types as varchar in TYPE_ARRAY in above code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. The nested types are supported now.


auto field_column = struct_column->field_column(field_name);
simdjson::ondemand::value field_value = obj.find_field_unordered(field_name);
RETURN_IF_ERROR(add_nullable_column(field_column.get(), field_type_desc, name, &field_value, true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name OR field_name ?
What is the name for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is the column name. It's used to report error.

{
// This is a tricky way to transform a std::string to simdjson:ondemand:value
std::string_view field_name_str = field.unescaped_key();
auto dummy_json = simdjson::padded_string(R"({"dummy_key": ")" + std::string(field_name_str) + R"("})");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the tricy way? Does the field_key need to be a JSON value rather than a simple string, so you parse a dummy_json to get the key? Why not implemet a add_nullable_column with string value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Since the last argument ofadd_nullable_column is simdjson::ondemand::value, we have to convert the simple string to simdjson::ondemand::value.
There're many functions using the simdjson::ondemand::value as the input argument, implementing a add_nullable_column with string/float/double/int value will involve massive changes.

auto map_column = down_cast<MapColumn*>(column);

try {
simdjson::ondemand::object obj = value->get_object();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if value is not an object, what will be the error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An exception The JSON element does not have the requested type. will be thrown and be caught later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give more detailed info, such as the expected type is xxx, the real given type is yyy. It will be much clearer.


namespace starrocks {

Status add_map_column(Column* column, const TypeDescriptor& type_desc, const std::string& name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about extracting array processing code to add_array_column?

@@ -111,4 +112,34 @@ TEST_F(AddNullableColumnTest, add_null_numeric_array) {
column->check_or_die();
}

TEST_F(AddNullableColumnTest, test_add_struct) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about nested Struct/map/array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are supported. I've added more information in the PR description.

const auto& field_name = type_desc.field_names[i];
const auto& field_type_desc = type_desc.children[i];

auto field_column = struct_column->field_column(field_name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it case sensitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If the struct field name has a different case than the JSON field name, the struct field value will be a NULL.

@rickif rickif force-pushed the feat/json-to-struct branch from e5169ac to eeb8056 Compare May 28, 2024 14:15
@rickif
Copy link
Contributor Author

rickif commented May 28, 2024

@Mergifyio rebase

rickif and others added 11 commits May 28, 2024 23:59
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
Signed-off-by: ricky <rickif@qq.com>
Co-authored-by: wyb <wybb86@gmail.com>
Signed-off-by: ricky <rickif@qq.com>
Co-authored-by: wyb <wybb86@gmail.com>
Signed-off-by: ricky <rickif@qq.com>
@meegoo meegoo enabled auto-merge (squash) May 29, 2024 02:44
Copy link

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[BE Incremental Coverage Report]

pass : 118 / 134 (88.06%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/formats/json/nullable_column.cpp 38 46 82.61% [213, 214, 234, 235, 249, 250, 267, 268]
🔵 be/src/formats/json/struct_column.cpp 16 19 84.21% [29, 30, 31]
🔵 be/src/formats/json/map_column.cpp 25 28 89.29% [30, 31, 32]
🔵 be/src/exec/json_scanner.cpp 39 41 95.12% [279, 550]

@meegoo meegoo merged commit a4f53d3 into StarRocks:main May 29, 2024
61 checks passed
@wyb
Copy link
Contributor

wyb commented May 30, 2024

https://github.com/Mergifyio backport branch-3.3

Copy link
Contributor

mergify bot commented May 30, 2024

backport branch-3.3

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request May 30, 2024
Signed-off-by: ricky <rickif@qq.com>
Co-authored-by: wyb <wybb86@gmail.com>
(cherry picked from commit a4f53d3)
wanpengfei-git pushed a commit that referenced this pull request Jun 1, 2024
@wyb wyb mentioned this pull request Jun 13, 2024
24 tasks
@StarRocks StarRocks deleted a comment from mergify bot Jun 20, 2024
@wyb
Copy link
Contributor

wyb commented Jun 20, 2024

https://github.com/Mergifyio backport branch-3.2

Copy link
Contributor

mergify bot commented Jun 20, 2024

backport branch-3.2

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support converting JSON to STRUCT
6 participants