Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage/pg: introduce pg_snapshot_statement_timeout LD flag #21150

Merged
merged 5 commits into from
Sep 19, 2023

Conversation

sploiselle
Copy link
Contributor

@sploiselle sploiselle commented Aug 11, 2023

Motivation

This PR adds a known-desirable feature. MaterializeInc/database-issues#4657

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@sploiselle sploiselle changed the title Pg statement timeout storage/pg: introduce pg_snapshot_statement_timeout LD flag Aug 11, 2023
Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo question about the default value of the parameter.

/// Sets the `statement_timeout` value to use during the snapshotting phase of
/// PG sources.
const PG_SNAPSHOT_STATEMENT_TIMEOUT: ServerVar<Duration> = ServerVar {
name: UncasedStr::new("pg_snapshot_statement_timeout"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preference to spell out more clearly that this is about PostgreSQL sources and not other PG-like things (e.g., pgwire, the SQL interface generally, whatever).

Suggested change
name: UncasedStr::new("pg_snapshot_statement_timeout"),
name: UncasedStr::new("postgres_source_snapshot_statement_timeout"),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, crap, just saw that we introduced pg_replication_* variables already. Oh well. Scratch my suggestion. Consistency with what we have is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benesch The pg_replication_* variables will eventually go away in favor of being applied more generally to all PG client connections. Let me see if I can bundle that work into this and we can make the names less ambiguous overall.

&format!("SET statement_timeout = {}", config.params.pg_snapshot_statement_timeout.as_millis())
).await?;

mz_ore::soft_assert!{{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use of a soft_assert!!

value: &mz_postgres_util::DEFAULT_SNAPSHOT_STATEMENT_TIMEOUT,
description: "Sets the `statement_timeout` value to use during the snapshotting phase of PG sources (Materialize)",
internal: true,
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you do me a favor and make sure this works properly for durations of 0? I think it gets mapped to usize::MAX which might not work? But also you wrote a test and that didn't choke. So IDK.

Slack context about the weirdness of duration configuration parameters: https://materializeinc.slack.com/archives/C02FWJ94HME/p1691674999794559

@benesch
Copy link
Member

benesch commented Aug 14, 2023

Thanks for doing this, @sploiselle! Will be glad to not have to debug this issue a third time!

@sploiselle sploiselle force-pushed the pg-statement-timeout branch 2 times, most recently from 4b17bb3 to 9a2dc76 Compare August 16, 2023 15:13
@sploiselle sploiselle marked this pull request as ready for review August 16, 2023 20:15
@sploiselle sploiselle requested a review from a team August 16, 2023 20:15
@sploiselle sploiselle requested a review from a team as a code owner August 16, 2023 20:15
@sploiselle sploiselle requested a review from chaas August 16, 2023 20:15
@shepherdlybot
Copy link

shepherdlybot bot commented Aug 16, 2023

This PR has higher risk. Make sure to carefully review the file hotspots. In addition to having a knowledgeable reviewer, it may be useful to add observability and/or a feature flag. What's This?

Risk Score Probability Buggy File Hotspots
🔴 80 / 100 59% 3
Buggy File Hotspots:
File Percentile
../render/sources.rs 95
../session/vars.rs 96
../sequencer/inner.rs 100

@sploiselle
Copy link
Contributor Author

@MaterializeInc/adapter PTAL + cc @benesch if you're interested

Expanded the scope of this PR to include:

  • Using our preferred TCP defaults for all connections using postgres_util::Config

  • Changing postgres_util::Config::replication_timeout to postgres_util::Config::tcp_timeout and applying the timeout to all connections, not just replication connections

  • Renamed LaunchDarkly flags to express that these parameters affect the PG source clients, not just the replication connections.

    I had originally planned to make these params configure all PG connections, but e.g. ConnectioContext (used for connection validation) makes that challenging, so I abandoned the effort.

  • Changes the LaunchDarkly flags to be optional now that we apply our defaults within the lib itself so we don't need these values to override the tokio_postgres defaults.

Nightly failures are not caused by this PR and have something to do with statement logging.

@benesch
Copy link
Member

benesch commented Aug 17, 2023

Thanks for slogging! Going to tap out of the re-review due to time constraints but love the movement towards unifying these settings as much as possible.

value: &mz_postgres_util::DEFAULT_REPLICATION_KEEPALIVE_RETRIES,
/// when connecting to PG via `mz_postgres_util`.
const PG_SOURCE_KEEPALIVES_RETRIES: ServerVar<u32> = ServerVar {
name: UncasedStr::new("pg_source_keepalives_retries"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will break any LD settings we have. Do those need to be audited/duplicated to these new values before releasing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inadvertently, these features were never added to LaunchDarkly so they cannot have been set previously. Will add the new ones to LaunchDarkly, though.

name: UncasedStr::new("pg_source_connect_timeout"),
value: &mz_postgres_util::DEFAULT_CONNECT_TIMEOUT,
value: &None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a user runs SHOW ALL, what would appear with this change? Will they see the default values that will be set if unconfigured here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch here--made these non-optional again and will show their defaults.

Sean Loiselle added 2 commits September 19, 2023 13:20
Previously, the configurable TCP timeouts only applied when
establishing a PG replication client. With this change, we apply
the default TCP timeouts to all PG connections established through
postgres-util, while still making the values configurable.
@sploiselle sploiselle merged commit dfdd277 into MaterializeInc:main Sep 19, 2023
@sploiselle sploiselle deleted the pg-statement-timeout branch September 19, 2023 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants