Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logic to serialize/deserialize SetAccumulators #8660

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aditi-pandit
Copy link
Collaborator

@aditi-pandit aditi-pandit commented Feb 2, 2024

This is the second in a set of PRs to add support for spilling distinct aggregations (see full version in #7791).

The logic to serialize/deserialize SetAccumulators is used in the DistinctAggregations for spilling.

Copy link

netlify bot commented Feb 2, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 9aa2cef
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6711400a1b3fe40008123db9

@aditi-pandit aditi-pandit marked this pull request as draft February 2, 2024 19:51
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 2, 2024
@aditi-pandit aditi-pandit marked this pull request as ready for review February 3, 2024 00:18
@aditi-pandit aditi-pandit force-pushed the serialize_set_accumulators branch 2 times, most recently from 0cb45bc to 864d8c9 Compare February 3, 2024 06:04
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit Read the code for primitive types and strings. Some comments below.

velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
@aditi-pandit aditi-pandit force-pushed the serialize_set_accumulators branch 3 times, most recently from 08a0b29 to e2c8b17 Compare February 8, 2024 03:41
@aditi-pandit
Copy link
Collaborator Author

@mbasmanova : Have updated this code post rebase and addressing comments. PTAL.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.

velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
auto* rawBuffer = flatResult->getRawStringBufferWithSpace(totalBytes, true);

auto nullIndexValue = nullIndexSerializationValue();
memcpy(rawBuffer, &nullIndexValue, kSizeOfVector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use common::OutputByteStream instead of plain memcpy?

We can add seekp(position) method to it if needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel seekp(position) breaks stream (which are contiguous usually) abstraction. You don't think that way ?

void serialize(const VectorPtr& result, vector_size_t index) {
auto* flatResult = result->as<FlatVector<StringView>>();
auto* rawBuffer =
flatResult->getRawStringBufferWithSpace(stringSetBytes, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not work well when there are millions of values in the accumulator. Perhaps, we could put some limits to fail cleanly instead of crashing.

Spilling logic needs to calculate spill size for each row before deciding how many rows to spill at once, no? We may need to expose APIs to compute these sizes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some thinking it seems better to serialize into an ArrayVector rather than a single serialized string. I'm trying that approach.

velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
velox/exec/SetAccumulator.h Show resolved Hide resolved
velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
auto* flatResult = result->as<FlatVector<StringView>>();
auto* rawBuffer = flatResult->getRawStringBufferWithSpace(totalSize, true);

SerializationStream stream(rawBuffer, totalSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse common::OutputByteStream here and add necessary APIs to it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OutputByteStream is in velox/common/base/IOUtils.h which is a much lower level library. I feel adding a dependency of AddressableNonNullValueList which is at velox/exec layer to it is making it less re-usable.

This current code is a reasonable compromise.

wdyt ?

@aditi-pandit aditi-pandit force-pushed the serialize_set_accumulators branch 2 times, most recently from 804d81f to 0eda534 Compare February 29, 2024 07:28
@aditi-pandit
Copy link
Collaborator Author

@mbasmanova : Have updated the code to serialize to an ARRAY(VARBINARY) instead of a single String buffer. PTAL.

@aditi-pandit aditi-pandit force-pushed the serialize_set_accumulators branch 2 times, most recently from 80883f7 to 6ea1c43 Compare March 1, 2024 21:50
@aditi-pandit
Copy link
Collaborator Author

@xiaoxmeng : Meng, Would appreciate a round of review. Thanks !

@aditi-pandit aditi-pandit force-pushed the serialize_set_accumulators branch 2 times, most recently from e9be986 to e480a2e Compare March 21, 2024 01:20
Copy link

stale bot commented Jun 19, 2024

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@stale stale bot added the stale label Jun 19, 2024
@aditi-pandit aditi-pandit force-pushed the serialize_set_accumulators branch 2 times, most recently from 32ed1be to 173b5d3 Compare June 28, 2024 23:37
@stale stale bot removed the stale label Jun 28, 2024
@aditi-pandit aditi-pandit requested a review from Yuhta June 29, 2024 00:10
@aditi-pandit
Copy link
Collaborator Author

@xiaoxmeng : Meng, ping for review. Thanks

Copy link

stale bot commented Oct 17, 2024

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@stale stale bot added the stale label Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants