Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statistics: support building CMSketch with Top N #10163

Merged
merged 41 commits into from
Apr 24, 2019

Conversation

erjiaqing
Copy link
Contributor

What problem does this PR solve?

#10042 has about 600 lines, this pr picks out the core logic except that related to protobuf and storage.

What is changed and how it works?

Check List

Tests

  • Unit test

Related changes

#10042

@erjiaqing erjiaqing added type/enhancement The issue or PR belongs to an enhancement. component/statistics labels Apr 16, 2019
@codecov
Copy link

codecov bot commented Apr 17, 2019

Codecov Report

Merging #10163 into master will decrease coverage by 0.0456%.
The diff coverage is 78.7878%.

@@               Coverage Diff               @@
##             master    #10163        +/-   ##
===============================================
- Coverage   77.9597%   77.914%   -0.0457%     
===============================================
  Files           408       408                
  Lines         83656     83578        -78     
===============================================
- Hits          65218     65119        -99     
- Misses        13599     13619        +20     
- Partials       4839      4840         +1

@alivxxx alivxxx requested review from winoros, zz-jason and qw4990 April 17, 2019 03:23
// elements in data should not be modified after this call.
// Not exactly n elements, will add a few elements, the number of which is close to the n-th most element.
func (c *CMSketch) BuildTopN(data [][]byte, n, topnThreshold uint32, total uint64) {
sampleSize := uint64(len(data))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put it to where it is needed.

@alivxxx alivxxx requested a review from lzmhhh123 April 17, 2019 05:47
Copy link
Contributor

@XuHuaiyu XuHuaiyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of the code is no problem.

Our next step is to optimize the comments. 😄

@XuHuaiyu XuHuaiyu requested review from zz-jason and alivxxx April 24, 2019 02:25
@XuHuaiyu XuHuaiyu changed the title statistics: Support CMSketch with Top N (build only) statistics: support building CMSketch with Top N Apr 24, 2019
return min
res = min
}
// If res is small than some value, we think it is sampled occasionally
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If res is small than some value, we think it is sampled occasionally
// If res is smaller than some value, we think it is sampled occasionally

@@ -34,6 +34,18 @@ func (c *CMSketch) insert(val *types.Datum) error {
return nil
}

func prepareCMSWithTopN(d, w int32, val []*types.Datum, n uint32, total uint64) (*CMSketch, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func prepareCMSWithTopN(d, w int32, val []*types.Datum, n uint32, total uint64) (*CMSketch, error) {
func prepareCMSWithTopN(d, w int32, vals []*types.Datum, n uint32, total uint64) (*CMSketch, error) {

// This case, we should also update c.defaultValue
// Set default value directly will result in more error, instead, update it by 5%.
// This should make estimate better, if defaultValue becomes 0 frequently, commit this line.
c.defaultValue = uint64(float64(c.defaultValue)*0.95 + float64(c.defaultValue)*0.05)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need more comments about this line? At least I can't understand why should calculate like this.

sumTopN uint64
sampleNDV = uint32(len(sorted))
)
numTop = mathutil.MinUint32(sampleNDV, numTop) // In case numTop is bigger than sample NUV.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
numTop = mathutil.MinUint32(sampleNDV, numTop) // In case numTop is bigger than sample NUV.
numTop = mathutil.MinUint32(sampleNDV, numTop) // Ensure numTop no larger than sampNDV.

if i >= numTop && sorted[i]*3 < sorted[numTop-1]*2 && last != sorted[i] {
break
}
// These two values are only used for build topN, and they are not used in counting defaultValue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant comment?

// The final Top-N index may have at most 2*numTop elements for some less skewed sample.
for i := uint32(0); i < sampleNDV && i < numTop*2; i++ {
// Here, 2/3 is get by running tests, tested 1, 1/2, 2/3, and 2/3 is relative better than 1 and 1/2.
// If the frequency of i-th elements is close to n-th element, it is added to the top N index too.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about moving this comment on the top of the for loop and replace it with:

    // Only element whose frequency is not smaller than 2/3 multiples the
    // frequency of the n-th element are added to the TopN statistics. We chose
    // 2/3 as an empirical value because the average cardinality estimation
    // error is relatively small compared with 1, 1/2 and 1/3.

Copy link
Member

@zz-jason zz-jason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zz-jason zz-jason merged commit fa2d6f0 into pingcap:master Apr 24, 2019
Copy link
Contributor

@eurekaka eurekaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please illustrate more on the defaultValue logic? I don't understand the rationales behind it.


// calculateEstimateNDV calculates the estimate ndv of a sampled data from a multisize with size total.
// count[i] stores the count of the i-th element.
// onlyOnceItems is the number of elements that occurred only once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are obsolete?

}

func buildCMSWithTopN(helper *topNHelper, d, w int32, scaleRatio uint64) (c *CMSketch) {
c, helper.sumTopN, helper.numTop = NewCMSketch(d, w), 0, 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set helper.sumTopN and helper.numTop to 0 at this line, enableTopN would always be false? After we record these 2 values in newTopNHelper, we never use them and then set them to 0? This looks unreasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they should be put after enableTopN


// queryAddTopN TopN adds count to CMSketch.topN if exists, and returns the count of such elements after insert.
// If such elements does not in topn elements, nothing will happen and false will be returned.
func (c *CMSketch) updateTopNWithDelta(h1, h2 uint64, d []byte, delta uint64) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are obsolete?

c.defaultValue = 1
} else if estimateNDV <= uint64(helper.numTop) {
c.defaultValue = 1
} else if estimateNDV+helper.onlyOnceItems <= uint64(sampleNDV) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale behind this condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conditon is to avoid estimateNDV - sampleNDV + helper.onlyOnceItems < 0
However, this will never happen and should be deleted.

c.defaultValue = 1
} else {
estimateRemainingCount := rowCount - (helper.sampleSize-uint64(helper.onlyOnceItems))*scaleRatio
c.defaultValue = estimateRemainingCount / (estimateNDV - uint64(sampleNDV) + helper.onlyOnceItems)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale behind this formula?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formula is to calculate the estimate count of not sampled data.
However, elements occurred only once in sampled data might be sampled just by chance.

}
}

func (c *CMSketch) considerDefVal(cnt uint64) bool {
return cnt < 2*(c.count/uint64(c.width)) && c.defaultValue > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale behind this formula?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like line 154, scaled count might be too big for some element, we should consider return default value.

@sre-bot sre-bot added the contribution This PR is from a community contributor. label Dec 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/statistics contribution This PR is from a community contributor. status/LGT1 Indicates that a PR has LGTM 1. type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants