[YSQL] Insert and Alter table race condition can cause not null constraint violation #12106

myang2021 · 2022-04-11T18:36:16Z

Jira Link: DB-1254

Description

The following sequence describes how a not null constraint in YSQL table can be violated:

PG executes an alter table statement, which is translated to a number of updates to PG sys catalog tables. Because the alter table is a DDL statement, a separate DDL transaction is used to manage the entire flow of the alter table statement, including all updates to the relevant PG sys catalog tables.
After PG sys catalog updates succeed, PG executes an additional YB alter table operation (YBCExecAlterTable). At this time, all the PG sys catalog updates reside in the intents db which can be rolled back if the DDL transaction is later aborted.
An AlterTable RPC is sent from PG to the TServer that the PG is bound to (its local TServer). For AlterTable, the TServer forwards the request to the Master.
The Master writes the new table schema metadata into the YB sys catalog table. This is different from the PG sys catalog table. Both are raft-replicated. However, YB sys catalog table does not have an intents db. So the alter table operation is directly written into the regular db of the YB sys catalog table to indicate that the alter table is on-going. After that it sends one AlterSchema RPC to each of the tablet replicas hosting the table being altered. This is because each tablet hosting the table T also stores schema metadata of T which is now old and needs to be updated.
The local TServer will keep polling Master whether the alter table is done via IsAlterTableDone RPC. The Master will wait for a response for each AlterSchema RPC. On a TServer to Master heartbeat, TServer will report its schema version to the Master. When Master finds all tablets have the latest table schema version, it will mark the AlterTable is done by directly "finalize" the alter table operation into the regular db of the YB sys catalog table. Once done, it has two effects: (1) the next IsAlterTableDone RPC will get a positive answer and the local TServer will reply to the PG that its AlterTable RPC has succeeded. (2) the next GetTableSchema RPC will return new schema version 1.
PG will invalidate its old YB table entry and reload T to have the new schema. It will also issue a request to increment master catalog version in table pg_yb_catalog_version. This increment is also covered by the same DDL transaction.
PG commits the DDL transaction which covers the entire alter table flow described above.

In case of a race between two sessions:

insert into T values (1)

and

alter table T add column v1 int not null

They cannot both succeed because of the not null constraint.

If insert reads PG sys catalog table prior to step 7, it will read the old PG schema metadata that does not have the not null constraint. However, if insert reads YB sys catalog after step 5, it will read the new YB table schema metadata.

Assume insert fails with a "schema version mismatch" error, this does not mean step 7 is done. It only means the TServer replica the insert has reached has already got the new schema. On a retry of the insert, it is possible that we are still in the window between step 5 and step 7. In this case insert can read old PG schema metadata if step 6 to increment master catalog version hasn't done yet, together with new YB schema metadata because step 5 is done. As a result, PG does not see the not null constraint on the old PG schema, and new YB schema metadata will no longer hit "schema version mismatch" error and the insert statement will succeed on the retry. The end result is that both insert and alter succeeded, violating the not null constraint.

The text was updated successfully, but these errors were encountered:

myang2021 added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Apr 11, 2022

myang2021 self-assigned this Apr 11, 2022

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jun 8, 2022

tverona1 mentioned this issue Jun 15, 2022

Online schema migrations #4192

Open

yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YSQL] Insert and Alter table race condition can cause not null constraint violation #12106

[YSQL] Insert and Alter table race condition can cause not null constraint violation #12106

myang2021 commented Apr 11, 2022 •

edited by yugabyte-ci

Loading

[YSQL] Insert and Alter table race condition can cause not null constraint violation #12106

[YSQL] Insert and Alter table race condition can cause not null constraint violation #12106

Comments

myang2021 commented Apr 11, 2022 • edited by yugabyte-ci Loading

Description

myang2021 commented Apr 11, 2022 •

edited by yugabyte-ci

Loading