Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support STORING columns INDEX #18616

Open
zhangjinpeng87 opened this issue Jul 16, 2020 · 6 comments
Open

Support STORING columns INDEX #18616

zhangjinpeng87 opened this issue Jul 16, 2020 · 6 comments
Labels
feature/discussing This feature request is discussing among product managers type/feature-request Categorizes issue or PR as related to a new feature. type/performance

Comments

@zhangjinpeng87
Copy link
Contributor

Feature Request

Is your feature request related to a problem? Please describe:

Assume we have a table named t, and there is a secondary index on column col3 and col4:

CREATE TABLE t (
  id INT primary key,
  col2 INT,
  col3 INT,
  col4 char(32), 
  key idx_col2_col3(col2, col3),
);

SELECT col4 from t WHERE col2=1 AND col3>5, this query will search the related id by the index idx_col2_col3 at first, and then read the value of col4, we call this process as double read.

Describe the feature you'd like:

If we store the column col4 in the index idx_col2_col3, then we can read the value of col4 directly. The CREATE statement looks like that:

CREATE TABLE t (
  id INT primary key,
  col2 INT,
  col3 INT,
  col4 char(32), 
  key idx_col2_col3(col2, col3) STORING (col4),
);

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

@zhangjinpeng87 zhangjinpeng87 added type/feature-request Categorizes issue or PR as related to a new feature. type/performance labels Jul 16, 2020
@zz-jason
Copy link
Member

zz-jason commented Jul 24, 2020

@zhangjinpeng1987 Are there any other references about the storing column index?

@zz-jason zz-jason added feature/reviewing This feature request is reviewing by product managers feature/discussing This feature request is discussing among product managers and removed feature/reviewing This feature request is reviewing by product managers labels Aug 6, 2020
@scsldb
Copy link

scsldb commented Aug 27, 2020

col4 add to idx_col2_col3 index, maybe the problem is solved.

@zhangjinpeng87
Copy link
Contributor Author

@scsldb They are different if there are several columns that need to retrieve or the value of these columns is large. Add columns to index means these columns will be encoded to the index key(large keys are not friendly to the storage engine), but if storing columns, these columns will be stored in the value.

@zz-jason https://www.cockroachlabs.com/docs/stable/indexes.html#storing-columns

@zhangjinpeng87
Copy link
Contributor Author

@scsldb @zz-jason Cloud-Spanner also has such a feature

Following description copy from https://cloud.google.com/spanner/docs/whitepapers/optimizing-schema-design

_STORING index clause
Secondary indexes allow you to find rows by attributes other than the primary key. If all the data requested is in the index itself, it can be consulted on its own without reading the primary record. This can save significant resources as no join is required.

Unfortunately, index keys are limited to 16 in number and 8 KiB in aggregate size, restricting what can be put in them. To compensate for these limitations, Cloud Spanner has the ability to store extra data in any index, via the STORING clause. STORING a column in an index results in its values being duplicated, with a copy stored in the index. You can think of an index with STORING as a simple single table materialized view (views are not natively supported in Cloud Spanner at this time).

Another useful application of STORING is as part of a NULL_FILTERED index. This allows you to define what is effectively a materialized view of a sparse subset of a table that you can scan efficiently. For example, you might create such an index on the is_unread column of a mailbox to be able to serve the unread messages view in a single table scan, but without paying for a complete copy of every mailbox._

@zz-jason
Copy link
Member

zz-jason commented Sep 8, 2020

This feature is awesome:

  • It reduces the read performance of queries on a secondary index by eliminating a probe operation on the table records.
  • It allows the user to choose which columns to be stored, i.e. clustered, together with the secondary index.

But in order not to waste the storage and increase the write latency caused by storing unnecessary columns, which columns to be stored in the secondary index should be well-designed.

@tirsen
Copy link
Contributor

tirsen commented Feb 2, 2021

This indeed looks very interesting!

It will also help in the following scenario:

CREATE TABLE payment (
  id           BIGINT PRIMARY KEY,
  sender_id    BIGINT,
  recipient_id BIGINT,
  amount_cents BIGINT
);

SELECT SUM(amount_cents) AS amount_sent
FROM payment
WHERE sender_id = ?;

SELECT SUM(amount_cents) AS amount_received
FROM payment
WHERE recipient_id = ?;

Both of these queries will do a double read and if the table is very large a large fan out scatter-gather query to fetch all the amount_cents data. Since the scatter-gather query has such a wide fan out we will not get benefit of pushing the aggregation operation down in tikv.

If we added these indexes:

ALTER TABLE payment ADD KEY idx_sender_id (sender_id) STORING (amount_cents);
ALTER TABLE payment ADD KEY idx_recipient_id (recipient_id) STORING (amount_cents);

This is very difficult to achieve in a traditionally sharded system where you can only pick one column to shard by (recipient or sender) and the other side will be punished. Regular clustered primary key indexes such as those that will be released in 5.0 will also not solve for both of these queries, only one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/discussing This feature request is discussing among product managers type/feature-request Categorizes issue or PR as related to a new feature. type/performance
Projects
None yet
Development

No branches or pull requests

4 participants