Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] "create operator class" fails in YB. The doc gives no warning of this. #9626

Open
bllewell opened this issue Aug 6, 2021 · 9 comments
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature pgcm priority/medium Medium priority issue

Comments

@bllewell
Copy link
Contributor

bllewell commented Aug 6, 2021

Jira Link: DB-1515
issue-9626.zip

Summary

The attached testcase unzips to produce two directories: testcase-1 and testcase-2. Two tests are needed to avoid conflating separable issues:

— You can’t execute “create operator class” in YB (even though the YSQL doc describes this statement with no caveats).

— You can’t create an lsm index “bare” (i.e. even without mentioning an operator class) on a UDT column in YB while you can create such a btree index in PG. (The testcase doesn't demonstrate this. But we all know it to be the case.) Research done by Minghui Yang has shown on PG that, as long as data types of the UTT's N individual fields support ordinary indexing (i.e. btree in PG), then ordering by UDT values and creating an index on a UDT column works in just the same way as it does for order by with a list of N columns or a composite index across N columns. And YB does allow creating a composite index and ordering by a column list. So it seems that there's no barrier to allowing an lsm index on a bare UDT in YB just as PG does this for btree (with the same caveat about the indexability of the individual fields).

Notice that probably the most valuable application for user-defined operators is exactly in connection with user-defined semantics for comparing UDT values. And this brings indexing (where the operator class is named) in its train. So we need both the ability to create a "bare" index on a UDT column and the ability to create an index on such a column with reference to an operator class. The second testcase is a soup-to-nuts demo of the latter in PG.

You should run the scripts in PG (they were tested in 13.3 but will run fine it 11.2.) When you're satisfied that they work as expected, then try to get them to work in YB. But expect to fail to be able to do this.

Background reading (in the PG doc—and a blog)

testcase-1

This test uses a scalar column of type text. It creates user-defined operators for equality and inequality that compare the left and right arguments case insensitively. As presented, it's written for PG—i.e. it mentions btree. Run it first like this. Then try to get it to work for YB. Look at 0.sql. It calls these scripts:

1-drop-all-testcase-objects.sql
2-cr-and-populate-tables-t1-and-t2.sql
3-cr-supporting-code-and-operators.sql

# This fails in YB. So there's no point in continuing.
4-cr-operator-class.sql 
5-cr-indexes.sql
6-do-queries.sql

And 6-do-queries.sql in turn calls do-single-table-test.sql. The code is self-explanatory. The 0.sql script can be run time and again from cold or just having run it before. You must edit it to connect to a superuser that has a default schema. Look for "\c demo super". This illustrates the basic idea:

...

create function eq(v1 in text, v2 in text) 
  returns boolean
  language plpgsql
as $body$
begin
  return lower(v1) = lower(v2);
end;
$body$;

...

create operator == (
  commutator = == ,
  negator    = <=> ,
  leftarg    = text,
  rightarg   = text,
  procedure  = eq);

create operator class user_defined_ops
  for type text
  using btree
as
  operator 1 << ,
  operator 2 <<= ,
  operator 3 == ,
  operator 4 >>= ,
  operator 5 >> ,
  function 1 int_eq(text, text);

create unique index t_v on t using btree (v user_defined_ops);

Nothing says that create operator class is not yet implemented (with the usual invitation to file a GitHub issue to ask for it). At the very least, this should be fixed.

Predicates are expressed using the newly-defined operators like this:

select t1.v as "t1.v", t2.v as "t2.v"
from t1, t2
where t1.v == t2.v;

And, of course, index support for SQL statements with such predicates relies on mentioning the operator class name in the create index statement as shown.

This script creates table t1 with this population:

   v    
--------
 BIRD
 CAT
 FISH
 SPIDER
 beetle
 doG
 dog
 frog
 lizard
 snake

Then this query:

select v from t1 where v == 'dog';

gets this result:

  v  
-----
 dog
 doG

Special syntax is needed for order by like this:

select v from t1 order by v using <<

This is the result:

   v    
--------
 beetle
 BIRD
 CAT
 dog
 doG
 FISH
 frog
 lizard
 snake
 SPIDER

Look for this (in 3-cr-supporting-code-and-operators.sql):

create operator == (
  commutator = == , -- comment this out to see that then
                    -- the index t2_v is NOT used
  negator    = <=> ,
  ...

And look for this in 0.sql:

\o equality-commutator-defined.txt

When you comment out the "commutator" line, change the spool file name to no-equality-commutator-defined.txt. and re-run 0.sql. Then diff the two spool files. When the commutator line is active and when only index t2_v exists, the plan has this:

Index Cond: (v == t1.v)

But when the commutator line is commented out and when only index t2_v exists, the plan has this:

Join Filter: (t1.v == t2.v)

This is in accord with what 37.15. Operator Optimization Information says:

It's critical to provide commutator information for operators that will be used in indexes and join clauses, because this allows the query optimizer to “flip around” such a clause to the forms needed for different plan types. For example, consider a query with a WHERE clause like tab1.x = tab2.y, where tab1.x and tab2.y are of a user-defined type, and suppose that tab2.y is indexed. The optimizer cannot generate an index scan unless it can determine how to flip the clause around to tab2.y = tab1.x, because the index-scan machinery expects to see the indexed column on the left of the operator it is given. PostgreSQL will not simply assume that this is a valid transformation — the creator of the = operator must specify that it is valid, by marking the operator with commutator information.

testcase-2

This is a more-or-less mechanical re-write of testcase-1 using a column based on the UTD rect (for rectangle):

create type rect as (h int, w int);

The idea here is that the user-defined operators will compare the areas of the right and left arguments. So this helper is defined:

create function area(r rect)
  returns int
  immutable
  language plpgsql
as $body$
begin
  return (r.h)*(r.w);
end;
$body$;

The rest of the test follows the same general pattern as testcase-1. In particular, the spelling of 0.sql is identical in both testcase-1 and testcase-2. So the remaining .sql files are also identically named. You should pairwise diff for each filename to appreciate just how similar they are. The differences are mainly just that the column data type text is replaced with rect. And the user-defined function area(rect, rect) in _testcase-2_takes the place of the built-in lower(text) in testcase-1. (Obviously, the code to populate the tables differs accordingly.) You'll also see, and readily understand, other differences.

Discussion

There are two different ways to define a UDT in PG:

either (basic) as a pure behavior-free tuple (“create type”):

https://www.postgresql.org/docs/11/sql-createtype.html

or (advanced) by going under the covers and providing C implementations for the type’s methods (38.12. User-defined Types):

https://www.postgresql.org/docs/11/xtypes.html

(The last step in the under-the-covers approach requires a particular, minimal, use of “create type”.)

If you want to create a semantically meaningful index on a basic behavior-free UDT, then you define the ordering rules by defining user-defined inequality and equality operators. Then you use “create operator class” to map the operators to the UDT’s name. Then, at index-creation time (and order by time) you mention the name of the operator class. The testcase-2 illustrates this.

@bllewell bllewell changed the title "create operator class" fails in YB. The doc gives no warning of this. [YSQL] "create operator class" fails in YB. The doc gives no warning of this. Aug 6, 2021
@bllewell bllewell added area/ycql Yugabyte CQL (YCQL) area/ysql Yugabyte SQL (YSQL) and removed area/ycql Yugabyte CQL (YCQL) labels Aug 6, 2021
@myang2021
Copy link
Contributor

Played with the test case 1 a bit, I had one new finding described below:

When creating the operator class 'user_defined_ops" in 4-cr-operator-class.sql, it has the clause "using btree". The operator class is created successfully in both PG and YB. So the earlier comment in testcase-1 should be

1-drop-all-testcase-objects.sql
2-cr-and-populate-tables-t1-and-t2.sql
3-cr-supporting-code-and-operators.sql
4-cr-operator-class.sql

# This fails in YB. So there's no point in continuing.
5-cr-indexes.sql
6-do-queries.sql

In YB:

create operator class user_defined_ops
  for type text
  using btree
as
  operator 1 << ,
  operator 2 <<= ,
  operator 3 == ,
  operator 4 >>= ,
  operator 5 >> ,
  function 1 int_eq(text, text);
CREATE OPERATOR CLASS
yugabyte=# \x
Expanded display is on.
yugabyte=# select * from pg_opclass where opcname = 'user_defined_ops';
select * from pg_opclass where opcname = 'user_defined_ops';
-[ RECORD 1 ]+-----------------
opcmethod    | 403
opcname      | user_defined_ops
opcnamespace | 2200
opcowner     | 13247
opcfamily    | 16410
opcintype    | 25
opcdefault   | f
opckeytype   | 0

Note that opcmethod is 403, it represents BTREE_AM_OID according to src/postgres/src/include/catalog/pg_am.dat

{ oid => '403', oid_symbol => 'BTREE_AM_OID',
  descr => 'b-tree index access method',
  amname => 'btree', amhandler => 'bthandler', amtype => 'i' },

However YB does not support btree index. In fact YB will turn btree into lsm with a NOTICE:

yugabyte=# create unique index t1_v on t1 using btree(v);
create unique index t1_v on t1 using btree(v);
NOTICE:  00000: index method "btree" was replaced with "lsm" in YugabyteDB
LOCATION:  DefineIndex, indexcmds.c:757
CREATE INDEX

Because YB uses "lsm" rather than "btree", that's why 'user_defined_ops' that can be used for "lsm" does not exist. The created 'user_defined_ops' is only meant for use with "btree".

ysqlsh:5-cr-indexes.sql:32: ERROR:  42704: operator class "user_defined_ops" does not exist for access method "lsm"

So it appears that YB should also turn btree into lsm in the create operator class statement, which would have resulted in
ERROR in 4-cr-operator-class.sql:

create operator class user_defined_ops
  for type text
  using lsm
as
  operator 1 << ,
  operator 2 <<= ,
  operator 3 == ,
  operator 4 >>= ,
  operator 5 >> ,
  function 1 int_eq(text, text);
ysqlsh:4-cr-operator-class.sql:16: ERROR:  0A000: unsupported index method lsm for CREATE OPERATOR FAMILY
LOCATION:  DefineOpClass, opclasscmds.c:371

@myang2021
Copy link
Contributor

I have some more updates.

First, let's see how postgres supports user_defined_ops in an index.

  1. Index creation
create table t1(k int primary key, v text not null);
create unique index t1_v on t1 using btree (v user_defined_ops);

When creating the index, from base table t1 we know the type of column v is text. In addition, user has specified a operator class 'user_defined_ops' so the default operator class associated with 'text' type (text_ops) will not be used for this index.

Once the index t1_v is created, the information that it will use 'user_defined_ops' rather than 'text_ops' as the operator class is stored in the catalog table pg_index.

  1. Index insertion
postgres=# insert into t1(k,v) values (11, 'DOG');

Because there is a unique index t1_v built on table t1, inserting into index requires comparison to enforce uniqueness constraint of t1_v. PG can find out the comparison support PL/pgSQL function by performing the following steps.

(2.1) From pg_class PG can find out the oid (object identifier, an 32-bit unsigned integer) of index t1_v:

postgres=# select oid, relname from pg_class where relname = 't1_v';
  oid  | relname
-------+---------
 24757 | t1_v
(1 row)

(2.2) From pg_index PG can get the operator class associated with t1_v:

postgres=# select indclass from pg_index where indexrelid = 24757;
 indclass
----------
 24749
(1 row)

The pg_index.indclass column is the operator class associated with t1_v.

(2.3) PG can lookup pg_opclass for operator class 24749:

postgres=# select * from pg_opclass where oid = 24749;
-[ RECORD 1 ]+-----------------
opcmethod    | 403
opcname      | user_defined_ops
opcnamespace | 2200
opcowner     | 10
opcfamily    | 24634
opcintype    | 25
opcdefault   | f
opckeytype   | 0

(2.4). From pg_amproc PG can get the support procs for the opclass:

postgres=# select * from pg_amproc where amprocfamily = 24634;
 amprocfamily | amproclefttype | amprocrighttype | amprocnum | amproc
--------------+----------------+-----------------+-----------+--------
        24634 |             25 |              25 |         1 | int_eq
(1 row)

We can see there is only 1 support proc associated with opclass user_defined_ops: int_eq.

(2.5). PG function handler will therefore dispatch to PL/pgSQL function int_eq to carry out the comparison.
(2.6) Because int_eq is a user defined function, it is executed by the PL/pgSQL byte code interpreter.

YB uses docdb as its storage engine. Docdb is an enhanced rocksdb. Rocksdb is logically a key-value store. Both key and value are just bytes that encode user data. The K-V store is logically sorted by keys. When inserting a new K-V pair, the new K will be compared with existing keys. This comparison is done in memory (the on-disk K-V pairs must be loaded into memory first). By default, rocksdb uses a byte-wise comparator which basically does memcmp(). This is called the default comparator. If byte-wise comparison does not meet user expection, rocksdb also has a mechanism to allow user to supply a function when the database is newly opened. This is called a custom comparator. When a custom comparator is provided, it replaces the default comparator and keys will be compared by the customer comparator which can meet user expection on how the keys should be sorted.

Currently, docdb only supports the default comparator. As a result, keys are compared bytewise via memcmp(). When this does not meet user-expection, then the user data must be cleverly encoded. Let's say there are two postgres keys k1 and k2, after encoding they become e(k1) and e(k2). The encoding must be done such that:

  k1 < k2  iff memcmp(e(k1), e(k2)) < 0
  k1 == k2 iff memcmp(e(k1), e(k2)) == 0
  k1 > k2  iff memcmp(e(k1), e(k2)) > 0

In other words, the encoded keys can be byte-wise compared while preserving the original order.

Apparently, for user-defined operator class user_defined_ops that compares two text values via user defined function int_eq, YB cannot look into int_eq to see what it does and then automatically come up with a "clever" encoding that can be bytewise compared while keeping the original order. This is just not possible.

If docdb supports custom comparator, then I can imagine to support int_eq like how PG's works above:

PG performs the same steps to resolve int_eq and the result is stored in docdb sys catalog table as metadata. Once the custom comparator is initialized, it looks up sys catalog table to find out the PL/pgSQL function name 'int_eq'. After that, each time the custom comparator executes, it sends a request to PG which can dispatch to PL/pgSQL function int_eq to carry out the comparison. The comparison result is sent back to the docdb to determine how the keys (i.e., postgres data) are sorted. In this case there is no need for any "clever" encoding because int_eq works directly on the un-encoded postgres raw data. Compared to native postgres, this obviously is very costly: it needs at least inter-process communication because docdb is in the yb-tserver process and PG is in the postgres process. It may even involve network traffic if the docdb sits on a different node from the one that runs the postgres process.

A more realistic work around is for user to implement int_eq in C instead of PL/pgSQL. This makes it works like a builtin function. The user must rebuild YB executables from source code. Note that the user will do the same rebuild for native postgres if a user-defined int_eq does not meet the performance requirement and therefore needs to be reimplemented in C. The builtin functions can be grouped into a shared library that is linked into the docdb process. Such C implementation of int_eq will be executed directly on raw data, no "clever" encoding is needed. In addition, it is executed within the docdb process only and therefore will perform much faster, which is important because ultimately that is why an index is built for.

@myang2021
Copy link
Contributor

I was looking at "YB should also turn btree into lsm in the create operator class statement". It turned out that it is not accurate to say YB supports lsm instead of btree. YB does support btree index for temporary table. For example

yugabyte=# create temp table t (id text);
create temp table t (id text);
CREATE TABLE
yugabyte=# insert into t values ('a'), ('b'), ('c');
insert into t values ('a'), ('b'), ('c');
INSERT 0 3
yugabyte=# create index t_id_idx on t (id);
create index t_id_idx on t (id);
CREATE INDEX
yugabyte=# \d t;
              Table "pg_temp_1.t"
 Column | Type | Collation | Nullable | Default
--------+------+-----------+----------+---------
 id     | text |           |          |
Indexes:
    "t_id_idx" btree (id ASC)

Note that t_id_idx is a btree.

Next I made the following changes in testcase-1 and testcase-2:

create table t1 ...

to

create temp table t1 ...

and

create table t2 ...

to

create temp table t2 ...

Then both testcase-1 and testcase-2 worked correctly in YB without any error. Therefore, the earlier comment "YB should also turn btree into lsm in the create operator class statement" cannot be done because that will break both testcase-1 and testcase-2 when t1 and t2 are created as temporary tables.

In YB, temporary tables are not raft-replicated and only exist on one node. Once the session is over (DB connection is disconnected), they are dropped automatically. Temporary tables and their indexes have no traces in docdb at all and they are entirely managed by PG. That's why for temporary tables YB does not support lsm index and instead supports btree index. It is for regular tables YB needs to use docdb as backend storage. Because of the nature of docdb as discussed above, regular table btree index is automatically turned into lsm index.

@bllewell
Copy link
Contributor Author

bllewell commented Sep 8, 2021

testcase-2.zip

NOTE: I ran the tests described below using YB-2.9.0.0-b0.

In the light of @myang2021's analysis, I made a new testcase using only testcase-2 from the .zip that I already submitted. I changed the two "create table" statements in 2-cr-and-populate-tables-t1-and-t2.sql to use "create temporary table". Now, the testcase ran to completion in YB. But when I compared the equality-commutator-defined.txt spool files produced by YB and by PG, I saw that the row-ordering differed between the two. This is unsurprising because the queries had no order by clause. So I added these by editing 6-do-queries.sql. The changes to these two files are the only differences between the old and the new versions of testcase-2.

Briefly, when I used order by t1.v using <<, the test ran without error in both PG and YB. But still, the ordering had differences. This, too, is unsurprising because the data was contrived to have several pairs of rows with the same area but different values of rect by exchanging the h and w values. So I added a second criterion to order by k. This caused errors in YB (but not, of course, in PG). So I then tried ordering just by k — but this, too, caused errors in YB.

Here is an example:

ERROR:  XX000: Not found: Error loading table with oid 17702 in database with oid 16758: Table with identifier 00004176000030008000000000004526 not found: OBJECT_NOT_FOUND (master error 3)
LOCATION:  HandleYBStatusAtErrorLevel, pg_yb_utils.c:363

The same error comes several times.

Look for this in 6-do-queries.sql:

prepare two_rows_row_select as
select v from t1
where
  v >> (1,15)::rect and
  v << (1,21)::rect
   order by t1.v using <<        -- Alt. 1
-- order by t1.k                 -- Alt. 2
-- order by t1.v using <<, t1.k  -- Alt. 3
;

The identical order_by alternatives are found in prepare operator_based_join too. Look, too, at prepare operator_based_order. This has corresponding alternatives. But they are spelled differently because this test uses just one table.

As presented, the test runs without error in YB and in PG. But when Alt. 1 is commented out, and Alt. 2 is uncommented, this produces the error. It's the same when Alt. 3 is used.

You should run the tests by hand, with one terminal connected to PG and the other connected to YB side-by-side. Then you'll understand where, and how, the errors occur more effectively than using words here allows.

@myang2021
Copy link
Contributor

myang2021 commented Sep 9, 2021

Good finding. After debugging with the new test cases Alt. 2 and Alt. 3, I found that this is a bug that is just recently fixed by https://phabricator.dev.yugabyte.com/D12666#557e08ef, the relevant change is:

--- a/src/postgres/src/backend/access/index/indexam.c
+++ b/src/postgres/src/backend/access/index/indexam.c
@@ -640,7 +640,7 @@ index_fetch_heap(IndexScanDesc scan)
         * - If YugaByte returns an index-tuple, the returned ybctid value should be used to query data.
         * - If YugaByte returns a heap_tuple, all requested data was already selected in the tuple.
         */
-       if (IsYugaByteEnabled())
+       if (IsYBRelation(scan->heapRelation))
        {
                if (scan->xs_hitup != 0)
                        return scan->xs_hitup;

Note that earlier we just say as long as IsYugaByteEnabled we will run YB code which will go to yb-master to lookup for the table and will fail because for temp table yb-master has nothing for it. Now we change that condition to test for IsYBRelation, for temp table, it will return false since temp table is not a YB relation (a relation that yb-master is aware of). So your new test cases will work in the latest master branch. I have verified that by running the new test cases against latest master build and I don't see any ERROR reported. When I ran against a 2.9.0 build, I do see the ERROR as described above.

@bllewell
Copy link
Contributor Author

bllewell commented Sep 9, 2021

Good news. Please update this issue's acount with the Version (e.g. YB-2.9.0.1-b0 or similar) when the version that has the fix is available for download from our Internet-facing Install YugabyteDB—macOS page and its cousins.

@saurik
Copy link

saurik commented Jan 31, 2022

@bllewell FWIW, having run into the inability to index on user-defined types today (in my case I first ran into it trying to make a primary key with a user-defined type that I had just coded up in C), I'll say that I'd be totally fine providing an implementation of a new operator class / function that returns a lexicographically-comparable buffer to use in the RocksDB key for DocDB.

@myang2021
Copy link
Contributor

@saurik , do you mean you already have such an implementation that you want to be included into the yugabyte main code base?

@saurik
Copy link

saurik commented Feb 1, 2022

@myang2021 By "providing an implementation" I mean as the app developer. Like, what I'm seeing you so far talking about in this issue for accomplishing this goal is build out support for the btree operator class down through DocDB into RocksDB to support custom comparators on keys... but I totally appreciate why that is complex and I can see a ton of problems that would cause and I'm not really expecting that to happen any time soon? The nice thing about that is that it will just work "out of the box" for a lot of existing types"; but, as someone who has been using PostgreSQL for about 20 years now and is starting to work with YugabyteDB, I'm kind of used to having to implement random operator classes to support new index types, and am thereby throwing out the idea that a much simpler approach--one that would work for now and should be easy to continue to be compatible later if you make it have higher priority or flag which was used to build an index--would be sufficient for me last night: you (YugabyteDB) make a new operator class with a function that we (the app developers) can implement that takes our user-defined type and returns a lexicographically-orderable bytea. (Given the error message I got I had actually expected that that would already be how it worked under the covers, but glancing through the YugabyteDB repository I failed to find such.)

@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jun 8, 2022
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature and removed kind/bug This issue is a bug labels Aug 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature pgcm priority/medium Medium priority issue
Projects
Status: No status
Development

No branches or pull requests

6 participants