sql: Include system indexes in read transactions #21820

jkosh44 · 2023-09-19T18:58:49Z

cf1271c claimed to have already completed this, however, due to an unfortunate git typo the original contents of that commit were replaced with a small refactor. This commit applies the original contents of
cf1271c. The original commit message is copied below.

System views from pg_catalog and mz_catalog are sometimes used by applications (for example Metabase if it doesn't recognize a data type) after the first query of a transaction. This previously would cause a timedomain error. One solution to this is to always include system indexes in read transactions. The cost is mostly that system indexes can now be held back from compaction more than previously. This seems ok because they don't change quickly, so should hopefully not cause memory issues.

Fixes MaterializeInc/database-issues#2868
Fixes MaterializeInc/database-issues#2867
Fixes MaterializeInc/database-issues#6542

Motivation

This PR adds a known-desirable feature.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:
- Allow pg_catalog objects in all read transactions.

cf1271c claimed to have already completed this, however, due to an unfortunate git typo the original contents of that commit were replaced with a small refactor. This commit applies the original contents of cf1271c. The original commit message is copied below. System views from pg_catalog and mz_catalog are sometimes used by applications (for example Metabase if it doesn't recognize a data type) after the first query of a transaction. This previously would cause a timedomain error. One solution to this is to always include system indexes in read transactions. The cost is mostly that system indexes can now be held back from compaction more than previously. This seems ok because they don't change quickly, so should hopefully not cause memory issues. Fixes #9375 Fixes #9374 Fixes #21815 Co-authored-by: Matt Jibson <matt.jibson@gmail.com>

jkosh44 · 2023-09-19T19:41:00Z

src/adapter/src/coord/timeline.rs

+        let system_id_bundle = self
+            .index_oracle(*self.catalog().get_mz_introspections_cluster_id())
+            .sufficient_collections(item_ids.iter());


Even if item_ids only contain items from the Postgres catalog schemas, sufficient_collections will still return indexes in the mz_catalog schema. So I just decided to always include all catalog schemas.

maddyblue · 2023-09-19T20:10:41Z

test/sqllogictest/github-8241.slt

@@ -39,3 +39,138 @@ DETAIL: The following relations in the query are outside the transaction's time
 "materialize.public.i2"
 Only the following relations are available:
 "materialize.public.i1"
+"materialize.public.t1"


A tad bit worried here that use case isolation is being violated here. Queries by users will start holding back system indexes and increasing memory use. Also having system and user id'd indexes seems not great, as that will balloon as users add many objects across many clusters. One of the arguments for only including pg_catalog is that it would limit this list of included indexes to a known set that doesn't change based on user created objects.

Do we have data on how expensive it is to add all these read holds? Is it mostly just adding to a btreemap so it's very cheap?

I didn't take a close look at this change, but I'm actually a bit baffled by this. Why wasn't materialize.public.t1 included in the timedomain previously? It's in the same schema as materialize.public.i1. Also, why did my changes cause it to be included? My changes only added system schemas, it didn't touch user schemas. I'll have to look into this.

Anyway, to answer your questions directly:

Do we have data on how expensive it is to add all these read holds?

I don't have any data on this, but it's expensiveness would depend on how long the transaction is held open.

Is it mostly just adding to a btreemap so it's very cheap?

My feeling is that it's rather expensive on the system indexes since compaction is stopped and they will just accumulate in-memory in the clusters.

Also having system and user id'd indexes seems not great, as that will balloon as users add many objects across many clusters.

We only include indexes in the active cluster and the mz_introspection cluster. So adding indexes across many clusters won't cause the read holds of a single transaction to balloon to include all of those indexes.

I think we have to at least sometimes mix user and system indexes. A query like SELECT * FROM t, mz_views already includes system indexes (the introspection sources of the current cluster) and user indexes (any indexes in the active cluster). This change just makes that the default behavior.

One of the arguments for only including pg_catalog is that it would limit this list of included indexes to a known set that doesn't change based on user created objects.

I'm not sure I'm following this. Are you suggesting to exclude user indexes from the timedomain? That would prevent users from using user indexes in a transaction which doesn't seem ideal. Or are you suggesting to exclude the introspection sources? Either way, I think you're right about only including pg_catalog by default. We can re-visit later if the other schemas are needed.

Why wasn't materialize.public.t1 included in the timedomain previously? It's in the same schema as materialize.public.i1. Also, why did my changes cause it to be included?

Ok I figured this out. A relation will be excluded from the timedomain, if there is an index on that relation in the timedomain.

src/adapter/src/coord/timeline.rs

maddyblue · 2023-09-20T00:23:30Z

src/adapter/src/coord/timeline.rs

@@ -840,6 +840,12 @@ impl Coordinator {
        let mut id_bundle: CollectionIdBundle = self
            .index_oracle(compute_instance)
            .sufficient_collections(item_ids.iter());
+        // Queries may get auto-routed to the mz_introspection cluster, so we include indexes on


What happens if they aren't auto routed. Does this over guess and do something wrong?

There's some weird interactions with timedomains and clusters, I just started a discussion here: https://materializeinc.slack.com/archives/C02FWJ94HME/p1695214693135449

This change makes it so that indexes in BOTH the active cluster and the mz_introspection cluster are included in the timedomain. Previously, only indexes in the active cluster were included in the timedomain. Without this change, the issue is that if you're query is auto-routed to mz_introspection and hits an index, then it will be aborted because that index is outside of the timedomain.

What happens if they aren't auto routed. Does this over guess and do something wrong?

If they aren't auto-routed, then we have the same behavior as before this change. The query will use the current cluster, which also has it's indexes in the timedomain (if we haven't switched clusters). The read holds on the indexes in mz_introspection end up being wasteful because we don't use them, but we don't do anything wrong.

jkosh44 · 2023-09-20T15:44:28Z

@mjibson This is ready for another review. I restricted the catalog schemas to only include pg_catalog by default, which contains no indexes. Additionally I removed the code that always adds indexes from mz_introspection.

maddyblue · 2023-09-20T15:52:47Z

src/adapter/src/coord/timeline.rs

            schemas.extend(system_schemas);
+        } else if !schemas.is_empty() {


I like having both of these if blocks. Feels like the right tradeoff.

maddyblue · 2023-09-20T17:29:54Z

src/adapter/src/coord/introspection.rs

@@ -144,6 +144,11 @@ pub fn auto_run_on_introspection<'a, 's, 'p>(
        return TargetCluster::Active;
    }

+    // Auto-routing mid transaction can cause an abort due to an invalid timedomain.


Does one of the added tests cover this?

Just added one. I meant to comment that some tests were failing without this because a bunch of pg_catalog queries get auto-routed and then fail mid-transaction.

jkosh44 · 2023-09-20T21:27:59Z

The current test error is also due to auto-routing. Here's a rough sketch of what's going on:

BEGIN
Query system tables.
Query is auto-routed to mz_introspection.
Time domain contains indexes from mz_introspection but not the underlying tables because of the indexes.
Query system tables with a function.
Query is not auto-routed to mz_introspection because it contains a function.
Current cluster has no indexes so we go to the underlying tables.
Tables are not in the timedomain so we fail.

To fix this, we need to set the cluster at the start of the transaction and not allow anyone to change it.

maddyblue · 2023-09-20T21:31:38Z

To fix this, we need to set the cluster at the start of the transaction and not allow anyone to change it.

Encouragement to do all that in another PR.

This commit updates the transaction logic so the entire transaction is executed on a single cluster. Users cannot change clusters mid-transaction and auto-routing is disabled mid-transaction. The first query in a transaction determines the time domain of the transaction. The timedomain only contains indexes on the active cluster of the first query. Switching clusters will almost always lead to timedomain errors. Additionally, some serving layer work will be moved to clusters in the future. Switching cluster mid-transactions will complicate that work. Including multiple cluster indexes in a single transaction's timedomain has negative use-case isolation implications because we need to take read holds on all indexes. Resolves #21839

jkosh44 · 2023-09-21T13:38:21Z

To fix this, we need to set the cluster at the start of the transaction and not allow anyone to change it.

Encouragement to do all that in another PR.

#21854

…to system-index-txns

shepherdlybot · 2023-09-21T14:39:20Z

This PR has higher risk. Make sure to carefully review the file hotspots. In addition to having a knowledgeable reviewer, it may be useful to add observability and/or a feature flag. What's This?

Risk Score	Probability	Buggy File Hotspots
🔴 81 / 100	61%	7

Buggy File Hotspots:

File	Percentile
../src/session.rs	98
../src/coord.rs	100
../src/catalog.rs	99
../coord/introspection.rs	97
../sequencer/inner.rs	99
../coord/timeline.rs	95
../src/error.rs	96

jkosh44 · 2023-09-21T14:40:13Z

I pulled in #21854 which is a blocker for this PR. I'll wait for that to merge before merging this PR.

jkosh44 · 2023-09-21T14:41:29Z

test/sqllogictest/transactions.slt

+# Since mz_views uses custom types, the postgres client will look it up in the catalog on
+# first use. If the first use happens to be in a transaction, then we can get unexpected time
+# domain errors. This is an annoying hack to load the information in the postgres client before
+# we start any transactions.
+statement ok
+SELECT * FROM mz_views LIMIT 0
+----


This is still required with this change due to https://github.com/MaterializeInc/materialize/issues/21815#issuecomment-1729712813, which is very unfortunate (CC @mjibson).

jkosh44 force-pushed the system-index-txns branch from 9c16035 to db971b6 Compare September 19, 2023 19:38

jkosh44 marked this pull request as ready for review September 19, 2023 19:38

jkosh44 requested a review from a team as a code owner September 19, 2023 19:38

jkosh44 requested review from ParkMyCar and maddyblue September 19, 2023 19:38

jkosh44 commented Sep 19, 2023

View reviewed changes

maddyblue reviewed Sep 20, 2023

View reviewed changes

jkosh44 added 3 commits September 20, 2023 09:41

Use COWs

44db71e

Restrict to just the pg_catalog schema

ffe9bce

Fix clippy

ee0e232

jkosh44 requested a review from maddyblue September 20, 2023 15:43

maddyblue approved these changes Sep 20, 2023

View reviewed changes

jkosh44 added 2 commits September 20, 2023 13:28

Dont auto-route mid transaction

024b54d

Fix test

eebfba8

maddyblue reviewed Sep 20, 2023

View reviewed changes

jkosh44 added 2 commits September 20, 2023 13:45

Add auto-route test

28d04f9

Fix auto-routing

42f4145

jkosh44 enabled auto-merge (squash) September 20, 2023 19:16

jkosh44 added 2 commits September 20, 2023 19:28

Fix test

bbb956e

jkosh44 added 4 commits September 21, 2023 09:45

Fix test

a854900

Fix test

51381b7

Merge branch 'timedomain-sanity' of github.com:jkosh44/materialize in…

3240c0b

…to system-index-txns

Fix merge skew

7c443fb

jkosh44 commented Sep 21, 2023

View reviewed changes

jkosh44 disabled auto-merge September 21, 2023 15:00

jkosh44 enabled auto-merge (squash) September 21, 2023 16:08

jkosh44 merged commit e51257e into MaterializeInc:main Sep 21, 2023

jkosh44 deleted the system-index-txns branch September 21, 2023 16:53

		schemas.extend(system_schemas);
		} else if !schemas.is_empty() {

sql: Include system indexes in read transactions #21820

sql: Include system indexes in read transactions #21820

Uh oh!

Conversation

jkosh44 commented Sep 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkosh44 commented Sep 20, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkosh44 commented Sep 20, 2023

Uh oh!

maddyblue commented Sep 20, 2023

Uh oh!

jkosh44 commented Sep 21, 2023

Uh oh!

shepherdlybot bot commented Sep 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkosh44 commented Sep 21, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jkosh44 commented Sep 19, 2023 •

edited

Loading

shepherdlybot bot commented Sep 21, 2023 •

edited

Loading