-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: Include system indexes in read transactions #21820
Conversation
cf1271c claimed to have already completed this, however, due to an unfortunate git typo the original contents of that commit were replaced with a small refactor. This commit applies the original contents of cf1271c. The original commit message is copied below. System views from pg_catalog and mz_catalog are sometimes used by applications (for example Metabase if it doesn't recognize a data type) after the first query of a transaction. This previously would cause a timedomain error. One solution to this is to always include system indexes in read transactions. The cost is mostly that system indexes can now be held back from compaction more than previously. This seems ok because they don't change quickly, so should hopefully not cause memory issues. Fixes #9375 Fixes #9374 Fixes #21815 Co-authored-by: Matt Jibson <matt.jibson@gmail.com>
9c16035
to
db971b6
Compare
src/adapter/src/coord/timeline.rs
Outdated
let system_id_bundle = self | ||
.index_oracle(*self.catalog().get_mz_introspections_cluster_id()) | ||
.sufficient_collections(item_ids.iter()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if item_ids
only contain items from the Postgres catalog schemas, sufficient_collections
will still return indexes in the mz_catalog
schema. So I just decided to always include all catalog schemas.
test/sqllogictest/github-8241.slt
Outdated
@@ -39,3 +39,138 @@ DETAIL: The following relations in the query are outside the transaction's time | |||
"materialize.public.i2" | |||
Only the following relations are available: | |||
"materialize.public.i1" | |||
"materialize.public.t1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A tad bit worried here that use case isolation is being violated here. Queries by users will start holding back system indexes and increasing memory use. Also having system and user id'd indexes seems not great, as that will balloon as users add many objects across many clusters. One of the arguments for only including pg_catalog is that it would limit this list of included indexes to a known set that doesn't change based on user created objects.
Do we have data on how expensive it is to add all these read holds? Is it mostly just adding to a btreemap so it's very cheap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't take a close look at this change, but I'm actually a bit baffled by this. Why wasn't materialize.public.t1
included in the timedomain previously? It's in the same schema as materialize.public.i1
. Also, why did my changes cause it to be included? My changes only added system schemas, it didn't touch user schemas. I'll have to look into this.
Anyway, to answer your questions directly:
Do we have data on how expensive it is to add all these read holds?
I don't have any data on this, but it's expensiveness would depend on how long the transaction is held open.
Is it mostly just adding to a btreemap so it's very cheap?
My feeling is that it's rather expensive on the system indexes since compaction is stopped and they will just accumulate in-memory in the clusters.
Also having system and user id'd indexes seems not great, as that will balloon as users add many objects across many clusters.
We only include indexes in the active cluster and the mz_introspection
cluster. So adding indexes across many clusters won't cause the read holds of a single transaction to balloon to include all of those indexes.
I think we have to at least sometimes mix user and system indexes. A query like SELECT * FROM t, mz_views
already includes system indexes (the introspection sources of the current cluster) and user indexes (any indexes in the active cluster). This change just makes that the default behavior.
One of the arguments for only including pg_catalog is that it would limit this list of included indexes to a known set that doesn't change based on user created objects.
I'm not sure I'm following this. Are you suggesting to exclude user indexes from the timedomain? That would prevent users from using user indexes in a transaction which doesn't seem ideal. Or are you suggesting to exclude the introspection sources? Either way, I think you're right about only including pg_catalog
by default. We can re-visit later if the other schemas are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wasn't materialize.public.t1 included in the timedomain previously? It's in the same schema as materialize.public.i1. Also, why did my changes cause it to be included?
Ok I figured this out. A relation will be excluded from the timedomain, if there is an index on that relation in the timedomain.
src/adapter/src/coord/timeline.rs
Outdated
@@ -840,6 +840,12 @@ impl Coordinator { | |||
let mut id_bundle: CollectionIdBundle = self | |||
.index_oracle(compute_instance) | |||
.sufficient_collections(item_ids.iter()); | |||
// Queries may get auto-routed to the mz_introspection cluster, so we include indexes on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if they aren't auto routed. Does this over guess and do something wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some weird interactions with timedomains and clusters, I just started a discussion here: https://materializeinc.slack.com/archives/C02FWJ94HME/p1695214693135449
This change makes it so that indexes in BOTH the active cluster and the mz_introspection
cluster are included in the timedomain. Previously, only indexes in the active cluster were included in the timedomain. Without this change, the issue is that if you're query is auto-routed to mz_introspection
and hits an index, then it will be aborted because that index is outside of the timedomain.
What happens if they aren't auto routed. Does this over guess and do something wrong?
If they aren't auto-routed, then we have the same behavior as before this change. The query will use the current cluster, which also has it's indexes in the timedomain (if we haven't switched clusters). The read holds on the indexes in mz_introspection
end up being wasteful because we don't use them, but we don't do anything wrong.
@mjibson This is ready for another review. I restricted the catalog schemas to only include |
schemas.extend(system_schemas); | ||
} else if !schemas.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having both of these if blocks. Feels like the right tradeoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
@@ -144,6 +144,11 @@ pub fn auto_run_on_introspection<'a, 's, 'p>( | |||
return TargetCluster::Active; | |||
} | |||
|
|||
// Auto-routing mid transaction can cause an abort due to an invalid timedomain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does one of the added tests cover this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added one. I meant to comment that some tests were failing without this because a bunch of pg_catalog queries get auto-routed and then fail mid-transaction.
The current test error is also due to auto-routing. Here's a rough sketch of what's going on:
To fix this, we need to set the cluster at the start of the transaction and not allow anyone to change it. |
Encouragement to do all that in another PR. |
This commit updates the transaction logic so the entire transaction is executed on a single cluster. Users cannot change clusters mid-transaction and auto-routing is disabled mid-transaction. The first query in a transaction determines the time domain of the transaction. The timedomain only contains indexes on the active cluster of the first query. Switching clusters will almost always lead to timedomain errors. Additionally, some serving layer work will be moved to clusters in the future. Switching cluster mid-transactions will complicate that work. Including multiple cluster indexes in a single transaction's timedomain has negative use-case isolation implications because we need to take read holds on all indexes. Resolves #21839
|
This PR has higher risk. Make sure to carefully review the file hotspots. In addition to having a knowledgeable reviewer, it may be useful to add observability and/or a feature flag. What's This?
Buggy File Hotspots:
|
I pulled in #21854 which is a blocker for this PR. I'll wait for that to merge before merging this PR. |
# Since mz_views uses custom types, the postgres client will look it up in the catalog on | ||
# first use. If the first use happens to be in a transaction, then we can get unexpected time | ||
# domain errors. This is an annoying hack to load the information in the postgres client before | ||
# we start any transactions. | ||
statement ok | ||
SELECT * FROM mz_views LIMIT 0 | ||
---- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still required with this change due to https://github.com/MaterializeInc/materialize/issues/21815#issuecomment-1729712813, which is very unfortunate (CC @mjibson).
cf1271c claimed to have already completed this, however, due to an unfortunate git typo the original contents of that commit were replaced with a small refactor. This commit applies the original contents of
cf1271c. The original commit message is copied below.
System views from pg_catalog and mz_catalog are sometimes used by applications (for example Metabase if it doesn't recognize a data type) after the first query of a transaction. This previously would cause a timedomain error. One solution to this is to always include system indexes in read transactions. The cost is mostly that system indexes can now be held back from compaction more than previously. This seems ok because they don't change quickly, so should hopefully not cause memory issues.
Fixes MaterializeInc/database-issues#2868
Fixes MaterializeInc/database-issues#2867
Fixes MaterializeInc/database-issues#6542
Motivation
This PR adds a known-desirable feature.
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.