-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize count(*) with table statistics #620
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great -- thank you @Dandandan.
I have filed https://github.com/influxdata/influxdb_iox/issues/1815 to get IOx using this upstream. Great stuff.
@@ -108,6 +108,11 @@ pub trait TableProvider: Sync + Send { | |||
/// Statistics should be optional because not all data sources can provide statistics. | |||
fn statistics(&self) -> Statistics; | |||
|
|||
/// Returns whether statistics provided are exact values or estimates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nice thing about adding a has_exact_statistics
is that it is a backwards compatible API
An alternate might be to encapsulate the "Exact statistics or not" into a field on Statistics
itself, which feels to me like it keeps related things together more, but has the downside of changing Statistics / APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a similar thought process I had.
Maybe at some point it would also be nice to tell what parts of the statistics are exact (e.g. number of rows) and what estimated (such as distinct count).
.unwrap(); | ||
let expected = "\ | ||
Projection: #COUNT(UInt8(1))\ | ||
\n Projection: UInt64(100) AS COUNT(Uint8(1))\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
I think we need to check for empty gby as well
After this PR I also plan some follow up issues:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff 👍 I recommend using logical plan builder to construct the new plans to make things more robust.
.unwrap(); | ||
|
||
let plan = ctx | ||
.create_logical_plan("select sum(a)/count(*) from test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a cool optimization 👍
Perhaps we can do this as a follow on PR |
Which issue does this PR close?
Closes #618
Rationale for this change
Speeding up queries (or part of a query) that can utilize the availability of the number of rows in the table statistics. Often used to inspect tables.
Not having to scan the data can save considerable time / cost.
On the TPC-H lineitem table we can see a 80x speed up:
Master:
What changes are included in this PR?
Are there any user-facing changes?