Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add native format writer to access StarRocks data bypass BE Server. #52700

Merged

Conversation

plotor
Copy link
Contributor

@plotor plotor commented Nov 7, 2024

Why I'm doing:

You can learn about the background from PR #52682.

What I'm doing:

In order to support bypassing the BE server, and directly reading and writing the data stored in the object storage in share-data mode, we extracted a format-sdk module. This SDK:

  • Provides the C++ version of Reader and Writer implementation, which can read and write StarRocks data files.
  • Provides the Java version of Reader and Writer (invokes the C++ Reader and Writer through JNI), which can be used by computing engines.
  • Implements data exchange through the Arrow format, making it easier for more programming languages to access.

This PR mainly contains the implementation of Writer. You can build it by:

mvn clean package -DskipTests

This will complete the compilation of C++ and Java codes, and finally generate the corresponding so and jar files. Then, you can call the SDK to perform data writing operations as follows (taking Java as an example):

// Create and open the Writer
StarRocksWriter srWriter = new StarRocksWriter(...);
srWriter.open();

// Write data, which organized in Arrow format
VectorSchemaRoot vsr = VectorSchemaRoot.create(schema, writer.getRootAllocator());
srWriter.write(vsr);

// Close the Writer
srWriter.close();

Fixes: #52682

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.4
    • 3.3
    • 3.2
    • 3.1

@mergify mergify bot assigned plotor Nov 7, 2024
@plotor plotor force-pushed the zhenchao-bypass-format-writer-20241107 branch 4 times, most recently from 300bdd1 to 1c009e6 Compare November 7, 2024 11:19
@plotor plotor force-pushed the zhenchao-bypass-format-writer-20241107 branch 4 times, most recently from 127475c to 69e0a59 Compare November 27, 2024 08:50
@plotor plotor marked this pull request as ready for review November 27, 2024 09:44
@plotor plotor requested a review from a team as a code owner November 27, 2024 09:44
@plotor plotor force-pushed the zhenchao-bypass-format-writer-20241107 branch from 69e0a59 to d6f19f3 Compare November 27, 2024 09:46
@plotor plotor force-pushed the zhenchao-bypass-format-writer-20241107 branch 5 times, most recently from e662ef3 to 45d87f8 Compare December 5, 2024 08:48
@@ -968,9 +968,15 @@ endif()

# Add all external dependencies. They should come after the starrocks libs.

set(STARROCKS_STD_DEPENDENCIES
-static-libstdc++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add this? these will be add later

if (NOT ("${MAKE_TEST}" STREQUAL "ON" AND "${BUILD_FOR_SANITIZE}" STREQUAL "ON"))
    # In other words, turn to dynamic link when MAKE_TEST and BUILD_TYPE == *SAN
    # otherwise do static link gcc's lib
    set(STARROCKS_LINK_LIBS ${STARROCKS_LINK_LIBS} -static-libstdc++ -static-libgcc)
endif()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have removed it

@@ -0,0 +1 @@
# StarRocks Format Library
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As planned, I will write a comprehensive document here in the next PR, including the use of Writer and Reader.

# specific language governing permissions and limitations
# under the License.

FILE(GLOB CONVERTER_SRC_FILES "*.cpp" "*.h")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall GLOB will affect incremental build, better list files explicitly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good advice, learned a lot.

Comment on lines 29 to 30
constexpr int32_t ARROW_CONVERTER_ID(arrow::Type::type arrow_type_id, starrocks::LogicalType sr_logical_type) {
return (arrow_type_id << 16) | (sr_logical_type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some assert/DCHECK to check overflow

Comment on lines 215 to 211
std::shared_ptr<starrocks::Column> ColumnConverter::get_data_column(
const std::shared_ptr<starrocks::Column> &column) {
if (column->is_nullable()) {
auto *nullable_column = down_cast<const NullableColumn *>(column.get());
return nullable_column->data_column();
} else if (column->is_constant()) {
auto *const_column = down_cast<const ConstColumn *>(column.get());
return const_column->data_column();
} else {
return column;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a static method?

ARROW_ASSIGN_OR_RAISE(
null_bitmap, arrow::internal::BytesToBits(
reinterpret_cast<const std::vector<uint8_t> &>(null_bytes),
const_cast<arrow::MemoryPool *>(_pool))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why make _pool const, then use const_cast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make _pool as const is mainly to prevent it from being changed in the column converter, but the param pool received by the BytesToBits method is a non-const type, so a const_cast conversion is done as a last resort.

Comment on lines 44 to 52
virtual arrow::Status open() = 0;

virtual arrow::Status write(const struct ArrowArray *c_arrow_array) = 0;

virtual arrow::Status flush() = 0;

virtual arrow::Status finish() = 0;

virtual void close() = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment document about how this class should be used, open/write/flush/finish order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, good suggention

return JNI_ERR;
}

kNativeOptExceptionClass = find_class(env, "Lcom/starrocks/format/jni/NativeOperateException;");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought FindClass does not need Lxxx; in "Lcom/starrocks/format/jni/NativeOperateException;" , should you double confirm this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the Oracle JNI docs, and the Lxxx here is not needed. I have removed it.

@plotor plotor force-pushed the zhenchao-bypass-format-writer-20241107 branch 4 times, most recently from d1b9dec to 9e6aa3a Compare December 16, 2024 11:43
… Server.

Signed-off-by: plotor <zhenchao.wang@hotmail.com>
@plotor plotor force-pushed the zhenchao-bypass-format-writer-20241107 branch from 9e6aa3a to 388ef6d Compare December 16, 2024 12:57
Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[BE Incremental Coverage Report]

pass : 14 / 15 (93.33%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/column/field.h 2 3 66.67% [113]
🔵 be/src/types/timestamp_value.cpp 5 5 100.00% []
🔵 be/src/types/date_value.cpp 5 5 100.00% []
🔵 be/src/exec/olap_scan_prepare.cpp 2 2 100.00% []

@decster decster merged commit ed87f90 into StarRocks:main Dec 18, 2024
51 of 52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Supports reading and writing data files bypass BE Server
3 participants