-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] "create operator class" fails in YB. The doc gives no warning of this. #9626
Comments
Played with the test case 1 a bit, I had one new finding described below: When creating the operator class 'user_defined_ops" in 4-cr-operator-class.sql, it has the clause "using btree". The operator class is created successfully in both PG and YB. So the earlier comment in testcase-1 should be
In YB:
Note that opcmethod is 403, it represents BTREE_AM_OID according to src/postgres/src/include/catalog/pg_am.dat
However YB does not support btree index. In fact YB will turn btree into lsm with a NOTICE:
Because YB uses "lsm" rather than "btree", that's why 'user_defined_ops' that can be used for "lsm" does not exist. The created 'user_defined_ops' is only meant for use with "btree".
So it appears that YB should also turn btree into lsm in the create operator class statement, which would have resulted in
|
I have some more updates. First, let's see how postgres supports
When creating the index, from base table t1 we know the type of column v is text. In addition, user has specified a operator class 'user_defined_ops' so the default operator class associated with 'text' type (text_ops) will not be used for this index. Once the index t1_v is created, the information that it will use 'user_defined_ops' rather than 'text_ops' as the operator class is stored in the catalog table pg_index.
Because there is a unique index t1_v built on table t1, inserting into index requires comparison to enforce uniqueness constraint of t1_v. PG can find out the comparison support PL/pgSQL function by performing the following steps. (2.1) From pg_class PG can find out the oid (object identifier, an 32-bit unsigned integer) of index t1_v:
(2.2) From pg_index PG can get the operator class associated with t1_v:
The pg_index.indclass column is the operator class associated with t1_v. (2.3) PG can lookup pg_opclass for operator class 24749:
(2.4). From pg_amproc PG can get the support procs for the opclass:
We can see there is only 1 support proc associated with opclass user_defined_ops: int_eq. (2.5). PG function handler will therefore dispatch to PL/pgSQL function int_eq to carry out the comparison. YB uses docdb as its storage engine. Docdb is an enhanced rocksdb. Rocksdb is logically a key-value store. Both key and value are just bytes that encode user data. The K-V store is logically sorted by keys. When inserting a new K-V pair, the new K will be compared with existing keys. This comparison is done in memory (the on-disk K-V pairs must be loaded into memory first). By default, rocksdb uses a byte-wise comparator which basically does memcmp(). This is called the default comparator. If byte-wise comparison does not meet user expection, rocksdb also has a mechanism to allow user to supply a function when the database is newly opened. This is called a custom comparator. When a custom comparator is provided, it replaces the default comparator and keys will be compared by the customer comparator which can meet user expection on how the keys should be sorted. Currently, docdb only supports the default comparator. As a result, keys are compared bytewise via memcmp(). When this does not meet user-expection, then the user data must be cleverly encoded. Let's say there are two postgres keys k1 and k2, after encoding they become e(k1) and e(k2). The encoding must be done such that:
In other words, the encoded keys can be byte-wise compared while preserving the original order. Apparently, for user-defined operator class user_defined_ops that compares two text values via user defined function int_eq, YB cannot look into int_eq to see what it does and then automatically come up with a "clever" encoding that can be bytewise compared while keeping the original order. This is just not possible. If docdb supports custom comparator, then I can imagine to support int_eq like how PG's works above: PG performs the same steps to resolve int_eq and the result is stored in docdb sys catalog table as metadata. Once the custom comparator is initialized, it looks up sys catalog table to find out the PL/pgSQL function name 'int_eq'. After that, each time the custom comparator executes, it sends a request to PG which can dispatch to PL/pgSQL function int_eq to carry out the comparison. The comparison result is sent back to the docdb to determine how the keys (i.e., postgres data) are sorted. In this case there is no need for any "clever" encoding because int_eq works directly on the un-encoded postgres raw data. Compared to native postgres, this obviously is very costly: it needs at least inter-process communication because docdb is in the yb-tserver process and PG is in the postgres process. It may even involve network traffic if the docdb sits on a different node from the one that runs the postgres process. A more realistic work around is for user to implement int_eq in C instead of PL/pgSQL. This makes it works like a builtin function. The user must rebuild YB executables from source code. Note that the user will do the same rebuild for native postgres if a user-defined int_eq does not meet the performance requirement and therefore needs to be reimplemented in C. The builtin functions can be grouped into a shared library that is linked into the docdb process. Such C implementation of int_eq will be executed directly on raw data, no "clever" encoding is needed. In addition, it is executed within the docdb process only and therefore will perform much faster, which is important because ultimately that is why an index is built for. |
I was looking at "YB should also turn btree into lsm in the create operator class statement". It turned out that it is not accurate to say YB supports lsm instead of btree. YB does support btree index for temporary table. For example
Note that t_id_idx is a btree. Next I made the following changes in testcase-1 and testcase-2:
to
and
to
Then both testcase-1 and testcase-2 worked correctly in YB without any error. Therefore, the earlier comment "YB should also turn btree into lsm in the create operator class statement" cannot be done because that will break both testcase-1 and testcase-2 when t1 and t2 are created as temporary tables. In YB, temporary tables are not raft-replicated and only exist on one node. Once the session is over (DB connection is disconnected), they are dropped automatically. Temporary tables and their indexes have no traces in docdb at all and they are entirely managed by PG. That's why for temporary tables YB does not support lsm index and instead supports btree index. It is for regular tables YB needs to use docdb as backend storage. Because of the nature of docdb as discussed above, regular table btree index is automatically turned into lsm index. |
NOTE: I ran the tests described below using YB-2.9.0.0-b0. In the light of @myang2021's analysis, I made a new testcase using only testcase-2 from the .zip that I already submitted. I changed the two "create table" statements in 2-cr-and-populate-tables-t1-and-t2.sql to use "create temporary table". Now, the testcase ran to completion in YB. But when I compared the equality-commutator-defined.txt spool files produced by YB and by PG, I saw that the row-ordering differed between the two. This is unsurprising because the queries had no order by clause. So I added these by editing 6-do-queries.sql. The changes to these two files are the only differences between the old and the new versions of testcase-2. Briefly, when I used order by t1.v using <<, the test ran without error in both PG and YB. But still, the ordering had differences. This, too, is unsurprising because the data was contrived to have several pairs of rows with the same area but different values of rect by exchanging the h and w values. So I added a second criterion to order by k. This caused errors in YB (but not, of course, in PG). So I then tried ordering just by k — but this, too, caused errors in YB. Here is an example:
The same error comes several times. Look for this in 6-do-queries.sql:
The identical order_by alternatives are found in prepare operator_based_join too. Look, too, at prepare operator_based_order. This has corresponding alternatives. But they are spelled differently because this test uses just one table. As presented, the test runs without error in YB and in PG. But when Alt. 1 is commented out, and Alt. 2 is uncommented, this produces the error. It's the same when Alt. 3 is used. You should run the tests by hand, with one terminal connected to PG and the other connected to YB side-by-side. Then you'll understand where, and how, the errors occur more effectively than using words here allows. |
Good finding. After debugging with the new test cases Alt. 2 and Alt. 3, I found that this is a bug that is just recently fixed by https://phabricator.dev.yugabyte.com/D12666#557e08ef, the relevant change is:
Note that earlier we just say as long as |
Good news. Please update this issue's acount with the Version (e.g. YB-2.9.0.1-b0 or similar) when the version that has the fix is available for download from our Internet-facing Install YugabyteDB—macOS page and its cousins. |
@bllewell FWIW, having run into the inability to index on user-defined types today (in my case I first ran into it trying to make a primary key with a user-defined type that I had just coded up in C), I'll say that I'd be totally fine providing an implementation of a new operator class / function that returns a lexicographically-comparable buffer to use in the RocksDB key for DocDB. |
@saurik , do you mean you already have such an implementation that you want to be included into the yugabyte main code base? |
@myang2021 By "providing an implementation" I mean as the app developer. Like, what I'm seeing you so far talking about in this issue for accomplishing this goal is build out support for the btree operator class down through DocDB into RocksDB to support custom comparators on keys... but I totally appreciate why that is complex and I can see a ton of problems that would cause and I'm not really expecting that to happen any time soon? The nice thing about that is that it will just work "out of the box" for a lot of existing types"; but, as someone who has been using PostgreSQL for about 20 years now and is starting to work with YugabyteDB, I'm kind of used to having to implement random operator classes to support new index types, and am thereby throwing out the idea that a much simpler approach--one that would work for now and should be easy to continue to be compatible later if you make it have higher priority or flag which was used to build an index--would be sufficient for me last night: you (YugabyteDB) make a new operator class with a function that we (the app developers) can implement that takes our user-defined type and returns a lexicographically-orderable bytea. (Given the error message I got I had actually expected that that would already be how it worked under the covers, but glancing through the YugabyteDB repository I failed to find such.) |
Jira Link: DB-1515
issue-9626.zip
Summary
The attached testcase unzips to produce two directories: testcase-1 and testcase-2. Two tests are needed to avoid conflating separable issues:
— You can’t execute “create operator class” in YB (even though the YSQL doc describes this statement with no caveats).
— You can’t create an lsm index “bare” (i.e. even without mentioning an operator class) on a UDT column in YB while you can create such a btree index in PG. (The testcase doesn't demonstrate this. But we all know it to be the case.) Research done by Minghui Yang has shown on PG that, as long as data types of the UTT's N individual fields support ordinary indexing (i.e. btree in PG), then ordering by UDT values and creating an index on a UDT column works in just the same way as it does for order by with a list of N columns or a composite index across N columns. And YB does allow creating a composite index and ordering by a column list. So it seems that there's no barrier to allowing an lsm index on a bare UDT in YB just as PG does this for btree (with the same caveat about the indexability of the individual fields).
Notice that probably the most valuable application for user-defined operators is exactly in connection with user-defined semantics for comparing UDT values. And this brings indexing (where the operator class is named) in its train. So we need both the ability to create a "bare" index on a UDT column and the ability to create an index on such a column with reference to an operator class. The second testcase is a soup-to-nuts demo of the latter in PG.
You should run the scripts in PG (they were tested in 13.3 but will run fine it 11.2.) When you're satisfied that they work as expected, then try to get them to work in YB. But expect to fail to be able to do this.
Background reading (in the PG doc—and a blog)
create operator
create operator class
37.14. User-Defined Operators
37.15. Operator Optimization Information
38.15. Interfacing Extensions To Indexes
Operator Classes Explained (blog)
testcase-1
This test uses a scalar column of type text. It creates user-defined operators for equality and inequality that compare the left and right arguments case insensitively. As presented, it's written for PG—i.e. it mentions btree. Run it first like this. Then try to get it to work for YB. Look at 0.sql. It calls these scripts:
And 6-do-queries.sql in turn calls do-single-table-test.sql. The code is self-explanatory. The 0.sql script can be run time and again from cold or just having run it before. You must edit it to connect to a superuser that has a default schema. Look for "\c demo super". This illustrates the basic idea:
Nothing says that create operator class is not yet implemented (with the usual invitation to file a GitHub issue to ask for it). At the very least, this should be fixed.
Predicates are expressed using the newly-defined operators like this:
And, of course, index support for SQL statements with such predicates relies on mentioning the operator class name in the create index statement as shown.
This script creates table t1 with this population:
Then this query:
gets this result:
Special syntax is needed for order by like this:
This is the result:
Look for this (in 3-cr-supporting-code-and-operators.sql):
And look for this in 0.sql:
When you comment out the "commutator" line, change the spool file name to no-equality-commutator-defined.txt. and re-run 0.sql. Then diff the two spool files. When the commutator line is active and when only index t2_v exists, the plan has this:
But when the commutator line is commented out and when only index t2_v exists, the plan has this:
This is in accord with what 37.15. Operator Optimization Information says:
testcase-2
This is a more-or-less mechanical re-write of testcase-1 using a column based on the UTD rect (for rectangle):
The idea here is that the user-defined operators will compare the areas of the right and left arguments. So this helper is defined:
The rest of the test follows the same general pattern as testcase-1. In particular, the spelling of 0.sql is identical in both testcase-1 and testcase-2. So the remaining .sql files are also identically named. You should pairwise diff for each filename to appreciate just how similar they are. The differences are mainly just that the column data type text is replaced with rect. And the user-defined function area(rect, rect) in _testcase-2_takes the place of the built-in lower(text) in testcase-1. (Obviously, the code to populate the tables differs accordingly.) You'll also see, and readily understand, other differences.
Discussion
There are two different ways to define a UDT in PG:
—either (basic) as a pure behavior-free tuple (“create type”):
—or (advanced) by going under the covers and providing C implementations for the type’s methods (38.12. User-defined Types):
(The last step in the under-the-covers approach requires a particular, minimal, use of “create type”.)
If you want to create a semantically meaningful index on a basic behavior-free UDT, then you define the ordering rules by defining user-defined inequality and equality operators. Then you use “create operator class” to map the operators to the UDT’s name. Then, at index-creation time (and order by time) you mention the name of the operator class. The testcase-2 illustrates this.
The text was updated successfully, but these errors were encountered: