Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

planner: fix wrong count when is out of range with datetime #58661

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

hawkingrei
Copy link
Member

@hawkingrei hawkingrei commented Jan 2, 2025

What problem does this PR solve?

Issue Number: close #50080

Problem Summary:

What changed and how does it work?

First, we found that this issue shows a significant difference between datetime with an index and without an index. Therefore, one of the goals of this PR is to reduce the error between them.

> explain select * from test where updated_date > '2023-12-31 23:50:00' and updated_date<'2024-01-01 00:00:00';

+---------------------------+------------+-----------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                        | estRows    | task      | access object                                     | operator info                                                                                                                                                                                  |
+---------------------------+------------+-----------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| IndexLookUp_10            | 9768858.44 | root      |                                                   |                                                                                                                                                                                                |
| ├─IndexRangeScan_8(Build) | 9768858.44 | cop[tikv] | table:test, index:IX_CUST_RELA_DATE(updated_date) | range:(2023-12-31 23:50:00,2024-01-01 00:00:00), keep order:false, stats:partial[ix_tpcnr_bcn:unInitialized, ix_tpcnr_epn:unInitialized, ix_tpcnr_pan:unInitialized...(more: 1 unInitialized)] |
| └─TableRowIDScan_9(Probe) | 9768858.44 | cop[tikv] | table:test                                        | keep order:false, stats:partial[ix_tpcnr_bcn:unInitialized, ix_tpcnr_epn:unInitialized, ix_tpcnr_pan:unInitialized...(more: 1 unInitialized)]                                                  |
+---------------------------+------------+-----------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

> drop index IX_CUST_RELA_DATE on test;

> explain select * from test where updated_date > '2023-12-31 23:50:00' and updated_date<'2024-01-01 00:00:00';

+---------------------+--------------+-----------+---------------+----------------------------------------------------------------------------------------------------------------+
| id                  | estRows      | task      | access object | operator info                                                                                                  |
+---------------------+--------------+-----------+---------------+----------------------------------------------------------------------------------------------------------------+
| TableReader_7       | 2333.21      | root      |               | data:Selection_6                                                                                               |
| └─Selection_6       | 2333.21      | cop[tikv] |               | gt(test.test.updated_date, 2023-12-31 23:50:00.000000), lt(test.test.updated_date, 2024-01-01 00:00:00.000000) |
|   └─TableFullScan_5 | 859718933.00 | cop[tikv] | table:test    | keep order:false                                                                                               |
+---------------------+--------------+-----------+---------------+----------------------------------------------------------------------------------------------------------------+

The issue actually occurs at this point: we found that datetime is first specially encoded into a uint64, and this uint64 is different from a timestamp — it cannot be directly subtracted.

image

by codec.EncodeKey -> encode -> EncodeMySQLTime -> ToPackedUint,
datetime will become a uint64.

// ToPackedUint encodes Time to a packed uint64 value.
//
//	 1 bit  0
//	17 bits year*13+month   (year 0-9999, month 0-12)
//	 5 bits day             (0-31)
//	 5 bits hour            (0-23)
//	 6 bits minute          (0-59)
//	 6 bits second          (0-59)
//	24 bits microseconds    (0-999999)
//
//	Total: 64 bits = 8 bytes
//
//	0YYYYYYY.YYYYYYYY.YYdddddh.hhhhmmmm.mmssssss.ffffffff.ffffffff.ffffffff

If we only use this uint64 to compare values, it can still be manageable. However, directly using it to compare the difference in magnitude will cause issues.

but PR is only for single-column index.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-tests-checked release-note-none Denotes a PR that doesn't merit a release note. sig/planner SIG: Planner size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 2, 2025
@hawkingrei hawkingrei marked this pull request as draft January 2, 2025 09:01
@ti-chi-bot ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 2, 2025
@hawkingrei hawkingrei marked this pull request as ready for review January 2, 2025 10:46
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 2, 2025
@hawkingrei hawkingrei changed the title planner: fix wrong count when is out of range with datetime planner: fix wrong count when is out of range with datetime [WIP] Jan 2, 2025
Copy link

ti-chi-bot bot commented Jan 2, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hawkingrei, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

codecov bot commented Jan 2, 2025

Codecov Report

Attention: Patch coverage is 71.42857% with 20 lines in your changes missing coverage. Please review.

Project coverage is 73.7014%. Comparing base (23ed0df) to head (74a9df1).
Report is 1 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #58661        +/-   ##
================================================
+ Coverage   73.0885%   73.7014%   +0.6128%     
================================================
  Files          1676       1706        +30     
  Lines        463646     471767      +8121     
================================================
+ Hits         338872     347699      +8827     
+ Misses       103927     102525      -1402     
- Partials      20847      21543       +696     
Flag Coverage Δ
integration 45.5865% <48.5714%> (?)
unit 72.2762% <71.4285%> (-0.0068%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.6910% <ø> (ø)
parser ∅ <ø> (∅)
br 45.7509% <ø> (+0.0179%) ⬆️

@hawkingrei
Copy link
Member Author

/retest

@hawkingrei
Copy link
Member Author

/retest

@hawkingrei hawkingrei changed the title planner: fix wrong count when is out of range with datetime [WIP] planner: fix wrong count when is out of range with datetime Jan 3, 2025
@hawkingrei hawkingrei requested a review from AilinKid January 3, 2025 08:07
@hawkingrei hawkingrei force-pushed the 50080 branch 2 times, most recently from c1f5638 to cda41cd Compare January 6, 2025 09:56
Signed-off-by: Weizhen Wang <wangweizhen@pingcap.com>
Signed-off-by: Weizhen Wang <wangweizhen@pingcap.com>
Signed-off-by: Weizhen Wang <wangweizhen@pingcap.com>
Signed-off-by: Weizhen Wang <wangweizhen@pingcap.com>
Copy link

ti-chi-bot bot commented Jan 7, 2025

@hawkingrei: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-lightning-integration-test 74a9df1 link true /test pull-lightning-integration-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines -71 to -81
valueTime := value.GetMysqlTime()
var minTime types.Time
switch valueTime.Type() {
case mysql.TypeDate:
minTime = types.NewTime(types.MinDatetime, mysql.TypeDate, types.DefaultFsp)
case mysql.TypeDatetime:
minTime = types.NewTime(types.MinDatetime, mysql.TypeDatetime, types.DefaultFsp)
case mysql.TypeTimestamp:
minTime = types.MinTimestamp
}
return float64(valueTime.Sub(UTCWithAllowInvalidDateCtx, &minTime).Duration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to change this.
We can directly use convertDatumToScalar() to replace convertMysqlTimeDatumToScalar() in this PR since the input is definitely of KindMysqlTime.

@@ -239,6 +239,10 @@ func getIndexRowCountForStatsV2(sctx planctx.PlanContext, idx *statistics.Index,
if debugTrace {
debugTraceStartEstimateRange(sctx, indexRange, lb, rb, totalCount)
}
isMysqlTime := indexRange.LowVal[0].Kind() == types.KindMysqlTime &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this variable put so far from where it's used?
It can be moved to where it's used, and part of it can be merged with isSingleColRange.

@@ -896,6 +897,22 @@ func (hg *Histogram) OutOfRange(val types.Datum) bool {
chunk.Compare(hg.Bounds.GetRow(hg.Bounds.NumRows()-1), 0, &val) < 0
}

func convertTime(d *types.Datum) (*types.Datum, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some comments to explain why we need the extra conversion.

return nil
}

func TestIssue50080(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't (1) put the user schema in the public test case and (2) put such a large stats file with the code if unnecessary.
Possible suggestion:

  1. change the column names and put this case into tests/integrationtest/ where the stats files are compressed.
  2. put this case into the private repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none Denotes a PR that doesn't merit a release note. sig/planner SIG: Planner size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

row count estimation on DATETIME type is over-estimated when time span across years/months
3 participants