Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43911: [C++] Compute Row: ListKeyEncoder Supports #43912

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Sep 2, 2024

Rationale for this change

Add ListKeyEncoder supports in RowEncoder

What changes are included in this PR?

  1. Add ListKeyEncoder supports in RowEncoder
  2. Change RowEncoder::Init to return Status

Are these changes tested?

Yes

Are there any user-facing changes?

Currently not, they're internal interfaces

Copy link

github-actions bot commented Sep 2, 2024

⚠️ GitHub issue #43911 has been automatically assigned in GitHub to PR creator.

@@ -29,6 +29,51 @@ using internal::FirstTimeBitmapWriter;
namespace compute {
namespace internal {

Result<std::shared_ptr<KeyEncoder>> MakeKeyEncoder(const TypeHolder& column_type, std::shared_ptr<ExtensionType>* extension_type, MemoryPool* pool) {
Copy link
Member Author

@mapleFU mapleFU Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also return unique_ptr here. I didn't see the purpose a shared_ptr being used

Also this function is extracted from RowEncoder

@github-actions github-actions bot added awaiting review Awaiting review awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 2, 2024
@mapleFU mapleFU marked this pull request as ready for review September 8, 2024 10:51
@@ -269,6 +272,190 @@ struct ARROW_EXPORT NullKeyEncoder : KeyEncoder {
}
};

template <typename ListType>
struct ARROW_EXPORT ListKeyEncoder : KeyEncoder {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder should I put this into .cc since it requires a lot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please do. It would be nice to hide most contents from this file into the corresponding .cc

Comment on lines +292 to +293
// AddLength for each list
std::vector<int32_t> child_lengthes;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for AddLength, if lots of value being used, AddLength call AddLength(child_lengthes.data(), length) rather than call with length 1

if (list_scalar.is_valid && list_scalar.value->length() > 0) {
auto element_count = static_cast<int32_t>(list_scalar.value->length());
// Counting the size of the encoded list
std::vector<int32_t> child_lengthes(element_count, 0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines +363 to +364
RETURN_NOT_OK(
this->element_encoder_->Encode(ExecValue{tmp_child_data}, 1, &encoded_ptr));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is a bit tricky, since Encode don't has interface for "encode 1 element", so this call Encode(1).

ARROW_ASSIGN_OR_RAISE(auto element_array, ::arrow::Concatenate(child_datas, pool));
element_data = element_array->data();
} else {
// If there are no elements, we need to create an empty array
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires an "empty" ArrayData

@mapleFU
Copy link
Member Author

mapleFU commented Sep 8, 2024

@pitrou @zanmato1984 @felipecrv I've written a basic impl, the performance here might be bad but this implmenet the basic logic. Would you mind take a look?

(I'll be out for vocation in 9.14 - 9.21, so maybe late response later)

auto raw_offsets = offset_buf->mutable_span_as<Offset>();
Offset element_sum = 0;
raw_offsets[0] = 0;
std::vector<std::shared_ptr<Array>> child_datas;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky, it always decode 1 from child, so we'll have "element-size" Array here...

@mapleFU
Copy link
Member Author

mapleFU commented Sep 8, 2024

I think with something like a callback:

struct ARROW_EXPORT KeyEncoder {
  virtual Result<std::shared_ptr<ArrayData>> DecodeWithBuilder(ArrayBuilder* builder) = 0;
}

I didn't find efficient interface for encoder, I may go through other code for help

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete review for now

@@ -514,7 +514,7 @@ std::vector<std::shared_ptr<Array>> GenRandomUniqueRecords(
val_types.push_back(result[i]->type());
}
RowEncoder encoder;
encoder.Init(val_types, ctx);
auto s = encoder.Init(val_types, ctx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please at least use DCHECK_OK.

@@ -18,10 +18,13 @@
#pragma once

#include <cstdint>
#include <iostream>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For debug, will remove

@@ -269,6 +272,190 @@ struct ARROW_EXPORT NullKeyEncoder : KeyEncoder {
}
};

template <typename ListType>
struct ARROW_EXPORT ListKeyEncoder : KeyEncoder {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please do. It would be nice to hide most contents from this file into the corresponding .cc

@@ -269,6 +272,190 @@ struct ARROW_EXPORT NullKeyEncoder : KeyEncoder {
}
};

template <typename ListType>
struct ARROW_EXPORT ListKeyEncoder : KeyEncoder {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining how the encoding looks like?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see you added a comment below.

VisitBitBlocksVoid(
validity, data.array.offset, data.array.length,
[&](int64_t i) {
ARROW_UNUSED(i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used.

validity, data.array.offset, data.array.length,
[&](int64_t i) {
ARROW_UNUSED(i);
child_lengthes.clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to call clear as you're calling resize below.

const uint8_t* validity = data.array.buffers[0].data;
const auto* offsets = data.array.GetValues<Offset>(1);
// AddLength for each list
std::vector<int32_t> child_lengthes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<int32_t> child_lengthes;
std::vector<int32_t> child_lengths;

Comment on lines +360 to +365
for (int64_t i = 0; i < child_array.length; i++) {
ArraySpan tmp_child_data(child_array);
tmp_child_data.SetSlice(child_array.offset + i, 1);
RETURN_NOT_OK(
this->element_encoder_->Encode(ExecValue{tmp_child_data}, 1, &encoded_ptr));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do it one element at a time? Why not instead:

Suggested change
for (int64_t i = 0; i < child_array.length; i++) {
ArraySpan tmp_child_data(child_array);
tmp_child_data.SetSlice(child_array.offset + i, 1);
RETURN_NOT_OK(
this->element_encoder_->Encode(ExecValue{tmp_child_data}, 1, &encoded_ptr));
}
RETURN_NOT_OK(
this->element_encoder_->Encode(ExecValue{child_array}, child_array.length, &encoded_ptr));

RETURN_NOT_OK(VisitBitBlocks(
validity, data.array.offset, data.array.length,
[&](int64_t i) {
ARROW_UNUSED(i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used.

Comment on lines +389 to +395
for (int64_t i = 0; i < batch_length; i++) {
RETURN_NOT_OK(handle_valid_value(span));
}
} else {
for (int64_t i = 0; i < batch_length; i++) {
handle_null_value();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could instead call handle_valid_value or handle_null_value once and then memcpy the result batch_length - 1 times.

Copy link
Collaborator

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small simplification.

auto& encoded_ptr = *encoded_bytes++;
*encoded_ptr++ = kNullByte;
util::SafeStore(encoded_ptr, static_cast<Offset>(0));
encoded_ptr += sizeof(Offset);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
encoded_ptr += sizeof(Offset);
encoded_ptr += sizeof(Offset);
return Status::OK();

Comment on lines +381 to +384
[&]() {
handle_null_value();
return Status::OK();
}));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[&]() {
handle_null_value();
return Status::OK();
}));
handle_null_value));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants