Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: Include system indexes in read transactions #21820

Merged
merged 14 commits into from
Sep 21, 2023

Conversation

jkosh44
Copy link
Contributor

@jkosh44 jkosh44 commented Sep 19, 2023

cf1271c claimed to have already completed this, however, due to an unfortunate git typo the original contents of that commit were replaced with a small refactor. This commit applies the original contents of
cf1271c. The original commit message is copied below.

System views from pg_catalog and mz_catalog are sometimes used by applications (for example Metabase if it doesn't recognize a data type) after the first query of a transaction. This previously would cause a timedomain error. One solution to this is to always include system indexes in read transactions. The cost is mostly that system indexes can now be held back from compaction more than previously. This seems ok because they don't change quickly, so should hopefully not cause memory issues.

Fixes MaterializeInc/database-issues#2868
Fixes MaterializeInc/database-issues#2867
Fixes MaterializeInc/database-issues#6542

Motivation

This PR adds a known-desirable feature.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:
    • Allow pg_catalog objects in all read transactions.

cf1271c claimed to have already
completed this, however, due to an unfortunate git typo the original
contents of that commit were replaced with a small refactor. This
commit applies the original contents of
cf1271c. The original commit message
is copied below.

System views from pg_catalog and mz_catalog are sometimes used by
applications (for example Metabase if it doesn't recognize a data type)
after the first query of a transaction. This previously would cause a
timedomain error. One solution to this is to always include system indexes
in read transactions. The cost is mostly that system indexes can now be
held back from compaction more than previously. This seems ok because
they don't change quickly, so should hopefully not cause memory issues.

Fixes #9375
Fixes #9374
Fixes #21815

Co-authored-by: Matt Jibson <matt.jibson@gmail.com>
@jkosh44 jkosh44 marked this pull request as ready for review September 19, 2023 19:38
@jkosh44 jkosh44 requested a review from a team as a code owner September 19, 2023 19:38
Comment on lines 845 to 847
let system_id_bundle = self
.index_oracle(*self.catalog().get_mz_introspections_cluster_id())
.sufficient_collections(item_ids.iter());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if item_ids only contain items from the Postgres catalog schemas, sufficient_collections will still return indexes in the mz_catalog schema. So I just decided to always include all catalog schemas.

@@ -39,3 +39,138 @@ DETAIL: The following relations in the query are outside the transaction's time
"materialize.public.i2"
Only the following relations are available:
"materialize.public.i1"
"materialize.public.t1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tad bit worried here that use case isolation is being violated here. Queries by users will start holding back system indexes and increasing memory use. Also having system and user id'd indexes seems not great, as that will balloon as users add many objects across many clusters. One of the arguments for only including pg_catalog is that it would limit this list of included indexes to a known set that doesn't change based on user created objects.

Do we have data on how expensive it is to add all these read holds? Is it mostly just adding to a btreemap so it's very cheap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't take a close look at this change, but I'm actually a bit baffled by this. Why wasn't materialize.public.t1 included in the timedomain previously? It's in the same schema as materialize.public.i1. Also, why did my changes cause it to be included? My changes only added system schemas, it didn't touch user schemas. I'll have to look into this.

Anyway, to answer your questions directly:

Do we have data on how expensive it is to add all these read holds?

I don't have any data on this, but it's expensiveness would depend on how long the transaction is held open.

Is it mostly just adding to a btreemap so it's very cheap?

My feeling is that it's rather expensive on the system indexes since compaction is stopped and they will just accumulate in-memory in the clusters.

Also having system and user id'd indexes seems not great, as that will balloon as users add many objects across many clusters.

We only include indexes in the active cluster and the mz_introspection cluster. So adding indexes across many clusters won't cause the read holds of a single transaction to balloon to include all of those indexes.

I think we have to at least sometimes mix user and system indexes. A query like SELECT * FROM t, mz_views already includes system indexes (the introspection sources of the current cluster) and user indexes (any indexes in the active cluster). This change just makes that the default behavior.

One of the arguments for only including pg_catalog is that it would limit this list of included indexes to a known set that doesn't change based on user created objects.

I'm not sure I'm following this. Are you suggesting to exclude user indexes from the timedomain? That would prevent users from using user indexes in a transaction which doesn't seem ideal. Or are you suggesting to exclude the introspection sources? Either way, I think you're right about only including pg_catalog by default. We can re-visit later if the other schemas are needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wasn't materialize.public.t1 included in the timedomain previously? It's in the same schema as materialize.public.i1. Also, why did my changes cause it to be included?

Ok I figured this out. A relation will be excluded from the timedomain, if there is an index on that relation in the timedomain.

src/adapter/src/coord/timeline.rs Outdated Show resolved Hide resolved
@@ -840,6 +840,12 @@ impl Coordinator {
let mut id_bundle: CollectionIdBundle = self
.index_oracle(compute_instance)
.sufficient_collections(item_ids.iter());
// Queries may get auto-routed to the mz_introspection cluster, so we include indexes on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if they aren't auto routed. Does this over guess and do something wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some weird interactions with timedomains and clusters, I just started a discussion here: https://materializeinc.slack.com/archives/C02FWJ94HME/p1695214693135449

This change makes it so that indexes in BOTH the active cluster and the mz_introspection cluster are included in the timedomain. Previously, only indexes in the active cluster were included in the timedomain. Without this change, the issue is that if you're query is auto-routed to mz_introspection and hits an index, then it will be aborted because that index is outside of the timedomain.

What happens if they aren't auto routed. Does this over guess and do something wrong?

If they aren't auto-routed, then we have the same behavior as before this change. The query will use the current cluster, which also has it's indexes in the timedomain (if we haven't switched clusters). The read holds on the indexes in mz_introspection end up being wasteful because we don't use them, but we don't do anything wrong.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Sep 20, 2023

@mjibson This is ready for another review. I restricted the catalog schemas to only include pg_catalog by default, which contains no indexes. Additionally I removed the code that always adds indexes from mz_introspection.

schemas.extend(system_schemas);
} else if !schemas.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having both of these if blocks. Feels like the right tradeoff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@@ -144,6 +144,11 @@ pub fn auto_run_on_introspection<'a, 's, 'p>(
return TargetCluster::Active;
}

// Auto-routing mid transaction can cause an abort due to an invalid timedomain.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does one of the added tests cover this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added one. I meant to comment that some tests were failing without this because a bunch of pg_catalog queries get auto-routed and then fail mid-transaction.

@jkosh44 jkosh44 enabled auto-merge (squash) September 20, 2023 19:16
@jkosh44
Copy link
Contributor Author

jkosh44 commented Sep 20, 2023

The current test error is also due to auto-routing. Here's a rough sketch of what's going on:

  1. BEGIN
  2. Query system tables.
  3. Query is auto-routed to mz_introspection.
  4. Time domain contains indexes from mz_introspection but not the underlying tables because of the indexes.
  5. Query system tables with a function.
  6. Query is not auto-routed to mz_introspection because it contains a function.
  7. Current cluster has no indexes so we go to the underlying tables.
  8. Tables are not in the timedomain so we fail.

To fix this, we need to set the cluster at the start of the transaction and not allow anyone to change it.

@maddyblue
Copy link
Contributor

To fix this, we need to set the cluster at the start of the transaction and not allow anyone to change it.

Encouragement to do all that in another PR.

This commit updates the transaction logic so the entire transaction is
executed on a single cluster. Users cannot change clusters
mid-transaction and auto-routing is disabled mid-transaction.

The first query in a transaction determines the time domain of the
transaction. The timedomain only contains indexes on the active cluster
of the first query. Switching clusters will almost always lead to
timedomain errors. Additionally, some serving layer work will be moved
to clusters in the future. Switching cluster mid-transactions will
complicate that work.

Including multiple cluster indexes in a single transaction's timedomain
has negative use-case isolation implications because we need to take
read holds on all indexes.

Resolves #21839
@jkosh44
Copy link
Contributor Author

jkosh44 commented Sep 21, 2023

To fix this, we need to set the cluster at the start of the transaction and not allow anyone to change it.

Encouragement to do all that in another PR.

#21854

@shepherdlybot
Copy link

shepherdlybot bot commented Sep 21, 2023

This PR has higher risk. Make sure to carefully review the file hotspots. In addition to having a knowledgeable reviewer, it may be useful to add observability and/or a feature flag. What's This?

Risk Score Probability Buggy File Hotspots
🔴 81 / 100 61% 7
Buggy File Hotspots:
File Percentile
../src/session.rs 98
../src/coord.rs 100
../src/catalog.rs 99
../coord/introspection.rs 97
../sequencer/inner.rs 99
../coord/timeline.rs 95
../src/error.rs 96

@jkosh44
Copy link
Contributor Author

jkosh44 commented Sep 21, 2023

I pulled in #21854 which is a blocker for this PR. I'll wait for that to merge before merging this PR.

Comment on lines +904 to +910
# Since mz_views uses custom types, the postgres client will look it up in the catalog on
# first use. If the first use happens to be in a transaction, then we can get unexpected time
# domain errors. This is an annoying hack to load the information in the postgres client before
# we start any transactions.
statement ok
SELECT * FROM mz_views LIMIT 0
----
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still required with this change due to https://github.com/MaterializeInc/materialize/issues/21815#issuecomment-1729712813, which is very unfortunate (CC @mjibson).

@jkosh44 jkosh44 enabled auto-merge (squash) September 21, 2023 16:08
@jkosh44 jkosh44 merged commit e51257e into MaterializeInc:main Sep 21, 2023
@jkosh44 jkosh44 deleted the system-index-txns branch September 21, 2023 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants