Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Improve performance of thetasketch dinstinct #1102

Merged

Conversation

Rachelint
Copy link
Contributor

@Rachelint Rachelint commented Jul 24, 2023

Rationale

Our impl for thetasketch_dinstinct udf is so rough, that leading to the bad performance.
The aggregate stage cost 17s in our production...

In this pr, I refactor the impl thetasketch_dinstinct for the better performance, now the aggregate stage will just cost 6s for the same sample sql.

Detailed Changes

  • Update hyperloglog crate for better performance.
  • Use DatumView instead of DfScalarValue in HllDistinct to eliminate the unnecessary clone.
  • Fix related test.

Test Plan

Test by new ut and integration.

common_types/src/datum.rs Outdated Show resolved Hide resolved
common_types/src/datum.rs Outdated Show resolved Hide resolved
server/src/http.rs Show resolved Hide resolved
df_operator/src/aggregate.rs Outdated Show resolved Hide resolved
Copy link
Member

@ShiKaiWi ShiKaiWi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ShiKaiWi ShiKaiWi merged commit da7a799 into apache:main Jul 25, 2023
@ShiKaiWi ShiKaiWi mentioned this pull request Jul 18, 2023
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants