Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Clustered Index #4841

Closed
2 tasks
hanfei1991 opened this issue Oct 20, 2017 · 4 comments
Closed
2 tasks

Support Clustered Index #4841

hanfei1991 opened this issue Oct 20, 2017 · 4 comments
Assignees
Labels
feature/accepted This feature request is accepted by product managers priority/P0 The issue has P0 priority. type/feature-request Categorizes issue or PR as related to a new feature.
Milestone

Comments

@hanfei1991
Copy link
Member

hanfei1991 commented Oct 20, 2017

Description

Currently, TiDB arranges table's row data by handle id, for example we have a table whose schema is

CRATE TABLE t (
  uid varchar(64),
  s int(10),
  data varchar(255),
  primary key (uid, s),
)

There will be two key-value pairs for each row data:

  • handle id -> (uid, s, data)
  • (uid, s) -> handle id

The handle id is a uint64 generated by TiDB. There are some disadvantages for this type of data arrangement:

  • we need to write two key-value pairs for each row data
  • when reading the data by primary key, TiDB will trigger two read operations, one is to fetch the related handle id by primary key, and the other is to fetch the data by the handle id. This will increase the latency of SELECT statements.

In the clustered index, we arrange row data by primary key, in the above example, there will be only one key-value pair for each row data, this will the disadvantages mentioned above:

  • (uid, s) -> (data)

Task list

  • P0: TiDB support clustered index
  • P0: TiFlash support clusered index
    TiFlash is column-oriented storage, it only stores specified tables' row data, used to speedup AP queries on these tables. After TiDB supports the clustered-index feature, TiFlash also should recognize it.

Progress Tracking: https://github.com/pingcap/tidb/projects/45

Category

Feature

Value

The clustered index can speed up some queries. Its usage can refer to:

Workload Estimation

600

Time

GanttStart: 2020-06-01
GanttDue: 2020-10-30
GanttProgress: 100%

Weekly Report

https://docs.google.com/document/d/1zeCni1g-BfGsi7B_qWSbSJGXFiWrDaUsqic14gXygk8/edit?usp=sharing

@hanfei1991 hanfei1991 added the priority/P2 The issue has P2 priority. label Oct 20, 2017
@hanfei1991 hanfei1991 added this to the 1.1 milestone Oct 20, 2017
@hanfei1991 hanfei1991 assigned XuHuaiyu and unassigned breezewish Nov 16, 2017
@ngaut ngaut removed this from the 2.0 milestone May 29, 2018
@morgo
Copy link
Contributor

morgo commented Mar 22, 2019

I am trying to understand this feature correctly. Is it a change so that the handle_id may be some other key or column, other than the primary key?

@gertvdijk
Copy link

Clustering your data determines how the data is laid out on disk. Several databases use different terminology to describe the same thing. Let me explain it by example.

Suppose you have a schema with columns object_id (integer, incrementing & unique), user_id (integer), data (binary/varchar or whatever) and you're running MySQL with InnoDB, the following options have a huge impact on your query performance:

  • PRIMARY KEY: object_id:
    Data on disk is ordered by this column. Results:

    • will be fast to SELECT points or ranges of object_id
    • INSERTs are very fast too, as they are effectively appended to the data files
    • SELECT queries for specific user_id values are super slow and require heavy random I/O even with secondary indexes
  • PRIMARY KEY: user_id, object_id:
    Data on disk is clustered by user_id, then by object_id. Results:

    • will be slow to SELECT points or ranges of object_id
    • INSERTs are very slow too, as they most likely spread out by user_id resulting in random writes
    • SELECTs for specific user_id values are super fast and require only a few seeks to get thousands of rows.

You can have secondary indexes to help you know where to look for data on disk, but if you want 1000 rows for a single user_id value, and your secondary index points you to 990 different pages on disk, you still see poor performance (unless you can afford yourself to have all your data on Optane SSDs). If you would choose to cluster your data by user_id, you can understand all user data is mostly clustered together and SELECT queries on specific user_ids are much faster, but then the INSERTs are very expensive because they happen with a huge spread on user_ids - ie. heavy write amplification (for almost every row a whole page needs to be rewritten, scattered across the volume). For some workloads like non-critical INSERT-time (queuing externally) and a huge user_id spread this may be the difference in several orders of magnitude for query time on SELECTs for specific users_ids.

In MySQL with InnoDB, the PRIMARY KEY means it is the clustering key, and there's only one clustering key. In MySQL/derivatives with TokuDB as storage engine you can have multiple clustering keys. That would make TokuDB effectively duplicate the table for you in different orders for each additional clustering key. Both TokuDB and InnoDB do a little bit of trade-off to not hurt INSERT performance too much. Consolidation of data before it's actually written to the data files, not fully filled pages and a bit of fragmentation is the result, but in practice it's still a huge benefit in performance for the SELECTs for specific user_ids.

PostgreSQL has even another approach: it does not order data on disk in a specific order. It just appends new INSERTS at the end of the data file, until you tell it to cluster it (e.g. CLUSTER command or pg_repack to do that online). This means that clustering basically runs out of band and implies a full table rebuild. (And yes, this is also true when you mark an index and clustering index.)

Cassandra works similarly to MySQL with regard to the user/schema with a primary key, but under the hood more like PostgreSQL with later compaction in the background to cluster data efficiently (but that's in fact quite complex, and this sentence is not very accurate once you know the details).


To get back to TiDB case; it would be very much beneficial to be able to control how data is laid out on disk and to be able to make a trade-off in INSERT performance to gain certain SELECT performance with it.

@jackysp jackysp added this to the v5.0-alpha milestone Jun 24, 2020
@jackysp jackysp added priority/P0 The issue has P0 priority. and removed priority/P0 The issue has P0 priority. labels Jun 24, 2020
@jackysp
Copy link
Member

jackysp commented Jun 24, 2020

https://github.com/pingcap/tidb/projects/45

Workload Estimation

600

@jackysp jackysp added priority/P0 The issue has P0 priority. type/feature-request Categorizes issue or PR as related to a new feature. and removed priority/P2 The issue has P2 priority. labels Jun 24, 2020
@zz-jason zz-jason changed the title Support clustered index Support Clustered Index Jul 14, 2020
@zanmato1984
Copy link
Contributor

Need to add a sub-task for TiFlash Clustered Index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/accepted This feature request is accepted by product managers priority/P0 The issue has P0 priority. type/feature-request Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests