-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
statistics: support building CMSketch with Top N #10163
Conversation
Codecov Report
@@ Coverage Diff @@
## master #10163 +/- ##
===============================================
- Coverage 77.9597% 77.914% -0.0457%
===============================================
Files 408 408
Lines 83656 83578 -78
===============================================
- Hits 65218 65119 -99
- Misses 13599 13619 +20
- Partials 4839 4840 +1 |
statistics/cmsketch.go
Outdated
// elements in data should not be modified after this call. | ||
// Not exactly n elements, will add a few elements, the number of which is close to the n-th most element. | ||
func (c *CMSketch) BuildTopN(data [][]byte, n, topnThreshold uint32, total uint64) { | ||
sampleSize := uint64(len(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put it to where it is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most of the code is no problem.
Our next step is to optimize the comments. 😄
statistics/cmsketch.go
Outdated
return min | ||
res = min | ||
} | ||
// If res is small than some value, we think it is sampled occasionally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If res is small than some value, we think it is sampled occasionally | |
// If res is smaller than some value, we think it is sampled occasionally |
statistics/cmsketch_test.go
Outdated
@@ -34,6 +34,18 @@ func (c *CMSketch) insert(val *types.Datum) error { | |||
return nil | |||
} | |||
|
|||
func prepareCMSWithTopN(d, w int32, val []*types.Datum, n uint32, total uint64) (*CMSketch, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func prepareCMSWithTopN(d, w int32, val []*types.Datum, n uint32, total uint64) (*CMSketch, error) { | |
func prepareCMSWithTopN(d, w int32, vals []*types.Datum, n uint32, total uint64) (*CMSketch, error) { |
// This case, we should also update c.defaultValue | ||
// Set default value directly will result in more error, instead, update it by 5%. | ||
// This should make estimate better, if defaultValue becomes 0 frequently, commit this line. | ||
c.defaultValue = uint64(float64(c.defaultValue)*0.95 + float64(c.defaultValue)*0.05) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need more comments about this line? At least I can't understand why should calculate like this.
statistics/cmsketch.go
Outdated
sumTopN uint64 | ||
sampleNDV = uint32(len(sorted)) | ||
) | ||
numTop = mathutil.MinUint32(sampleNDV, numTop) // In case numTop is bigger than sample NUV. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numTop = mathutil.MinUint32(sampleNDV, numTop) // In case numTop is bigger than sample NUV. | |
numTop = mathutil.MinUint32(sampleNDV, numTop) // Ensure numTop no larger than sampNDV. |
statistics/cmsketch.go
Outdated
if i >= numTop && sorted[i]*3 < sorted[numTop-1]*2 && last != sorted[i] { | ||
break | ||
} | ||
// These two values are only used for build topN, and they are not used in counting defaultValue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant comment?
statistics/cmsketch.go
Outdated
// The final Top-N index may have at most 2*numTop elements for some less skewed sample. | ||
for i := uint32(0); i < sampleNDV && i < numTop*2; i++ { | ||
// Here, 2/3 is get by running tests, tested 1, 1/2, 2/3, and 2/3 is relative better than 1 and 1/2. | ||
// If the frequency of i-th elements is close to n-th element, it is added to the top N index too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about moving this comment on the top of the for
loop and replace it with:
// Only element whose frequency is not smaller than 2/3 multiples the
// frequency of the n-th element are added to the TopN statistics. We chose
// 2/3 as an empirical value because the average cardinality estimation
// error is relatively small compared with 1, 1/2 and 1/3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please illustrate more on the defaultValue
logic? I don't understand the rationales behind it.
|
||
// calculateEstimateNDV calculates the estimate ndv of a sampled data from a multisize with size total. | ||
// count[i] stores the count of the i-th element. | ||
// onlyOnceItems is the number of elements that occurred only once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments are obsolete?
} | ||
|
||
func buildCMSWithTopN(helper *topNHelper, d, w int32, scaleRatio uint64) (c *CMSketch) { | ||
c, helper.sumTopN, helper.numTop = NewCMSketch(d, w), 0, 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we set helper.sumTopN
and helper.numTop
to 0 at this line, enableTopN
would always be false? After we record these 2 values in newTopNHelper
, we never use them and then set them to 0? This looks unreasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they should be put after enableTopN
|
||
// queryAddTopN TopN adds count to CMSketch.topN if exists, and returns the count of such elements after insert. | ||
// If such elements does not in topn elements, nothing will happen and false will be returned. | ||
func (c *CMSketch) updateTopNWithDelta(h1, h2 uint64, d []byte, delta uint64) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments are obsolete?
c.defaultValue = 1 | ||
} else if estimateNDV <= uint64(helper.numTop) { | ||
c.defaultValue = 1 | ||
} else if estimateNDV+helper.onlyOnceItems <= uint64(sampleNDV) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale behind this condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conditon is to avoid estimateNDV - sampleNDV + helper.onlyOnceItems < 0
However, this will never happen and should be deleted.
c.defaultValue = 1 | ||
} else { | ||
estimateRemainingCount := rowCount - (helper.sampleSize-uint64(helper.onlyOnceItems))*scaleRatio | ||
c.defaultValue = estimateRemainingCount / (estimateNDV - uint64(sampleNDV) + helper.onlyOnceItems) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale behind this formula?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This formula is to calculate the estimate count of not sampled data.
However, elements occurred only once in sampled data might be sampled just by chance.
} | ||
} | ||
|
||
func (c *CMSketch) considerDefVal(cnt uint64) bool { | ||
return cnt < 2*(c.count/uint64(c.width)) && c.defaultValue > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale behind this formula?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like line 154, scaled count might be too big for some element, we should consider return default value.
What problem does this PR solve?
#10042 has about 600 lines, this pr picks out the core logic except that related to protobuf and storage.
What is changed and how it works?
Check List
Tests
Related changes
#10042