Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: consider relaxing need for blocking cluster versions when performing Pebble migrations #96819

Open
nicktrav opened this issue Feb 8, 2023 · 8 comments
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team

Comments

@nicktrav
Copy link
Collaborator

nicktrav commented Feb 8, 2023

Is your feature request related to a problem? Please describe.

In instances where we need to ensure that all nodes in a cluster have Pebble stores satisfying some precondition (e.g. all "old" SSTables have been re-written; tables with split user keys have been compacted), we typically use two cluster versions. The first is used as a "mark" phase in one Cockroach version V. A second is then introduced in a subsequent release V+n (where n is usually 1) that blocks waiting for all the marked items to be processed.

This blocking operation in the version V+n is likely overly conservative. Instead of holding up the progression through subsequent cluster versions, the precondition could be incorporated into a background job. The completion of said job would then become a precondition for the finalization of the Cockroach version V+n.

Describe the solution you'd like

Investigate whether we can use the job framework and preconditions to accomplish the same goal, rather than employing a second cluster version.

Describe alternatives you've considered

Keep things the way they are (i.e. two cluster versions), at the expense of having a potentially long running Pebble background operation preventing other Cockroach migrations from occurring.

Additional context

More context in this internal thread.

Jira issue: CRDB-24351

@nicktrav nicktrav added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Feb 8, 2023
@nicktrav
Copy link
Collaborator Author

I managed to find the docs here on long running migrations. The corresponding RFC is here.

Given that we roll the Pebble FMV bump into an existing Cockroach cluster version that is used to enable value blocks (code), to avoid confusion, and to avoid re-arranging the existing Pebble FMVs, which could possibly break our (new) guarantees around alpha version compatibility, we can no-op and re-name the existing Pebble FMV. We can then re-introduce this as a new, higher numbered Pebble FMV, which has a direct correspondence with a new Cockroach internal cluster version that comes after the value blocks migration. This new internal cluster version will then be the one used in the migration job.

The re-introduction of the Pebble FMV should be safe, as the underlying rewrite compactions are idempotent. If all the tables had previously been rewritten, running the Pebble migration job again should itself be a quick operation (i.e. we only pay the cost of scanning the table versions, and there will be no rewrite compactions scheduled as the rewrites had already occurred).

The alternative here is to not ship this blocking migration in 23.1. To achieve this, we'd make the same no-op change suggested above. The downside is that this prevents us from bumping the min SSTable version to at least Pebblev1, which would block removing code that handles table props and split user keys.

@nicktrav
Copy link
Collaborator Author

I was looking at outdated code. I think what we want is now in pkg/upgrade. Specifically, I think we want a system migration.

@irfansharif
Copy link
Contributor

irfansharif commented Feb 15, 2023

FWIW, the alternative you’ve listed above:

Keep things the way they are (i.e. two cluster versions), at the expense of having a potentially long running Pebble background operation preventing other Cockroach migrations from occurring.

Is not so bad. This in fact is what the framework promotes — sequentially run long running migrations. We’ve run migrations that take in the order of minutes before (#58088), it’s fine. The latency of the version finalization is relatively unimportant given the timescales upgrades typically take (many minutes to even roll into new binaries). It doesn’t block foreground traffic or anything. Is there a more pressing need to introduce this asynchrony (= kick off a new job that runs the real migration, a job that might fail and you’d have to manage its own retries/lifecycle, plus check effects for in the later version)?

@nicktrav
Copy link
Collaborator Author

We’ve run migrations that take in the order of minutes before (#58088), it’s fine.

We're talking about timescales that are potentially on the order of hours, depending on the size of the cluster, and how much "cold" data there is in lower levels of the LSM. The migration itself needs to rewrite all SSTables that have a sufficiently old version. Here's an example of a CC cluster where it took ~35 mins for old tables to be re-written (internal metrics). And this is for a cluster that has no other compaction activity (re-write run with the lowest priority).

@jbowens
Copy link
Collaborator

jbowens commented Feb 27, 2023

@irfansharif what I'm hoping for most is not to avoid blocking the version finalization, but to block 23.1 finalization from beginning until this migration is complete. Currently, we'll unnecessarily commit to 23.1 before we need to—It'd be better for the user to preserve backwards compatibility for as long as possible, only committing when we're actually introducing a change incompatible with 22.2.

nicktrav added a commit to nicktrav/pebble that referenced this issue Feb 28, 2023
Currently, the `FormatPrePebblev1MarkedCompacted` (enum value `11`) is
used as a synchronization point for the DB, ensuring that all
pre-Pebblev1 tables have been rewritten. This format major version is
implicitly tied to a Cockroach internal cluster version that ships in
the 23.1 release. Not only is an implicit Pebble FMV upgrade confusing,
it unnecessarily blocking.

Rather than blocking the finalization of 23.1, the decision was made to
instead alter Cockroach to ensure that all tables had been rewritten
before ratcheting the Pebble FMV.

Rename the existing FMV for the blocking re-write. The version was
renamed to include `Unused`, along with a comment detailing the
reasoning.

Move the existing FMV, `FormatPrePebblev1MarkedCompacted`, to be the
latest in the sequence of FMVs, in a section dedicated to 23.2 FMVs.

Touches cockroachdb/cockroach#96819.
nicktrav added a commit to nicktrav/pebble that referenced this issue Mar 1, 2023
Currently, the `FormatPrePebblev1MarkedCompacted` (enum value `11`) is
used as a synchronization point for the DB, ensuring that all
pre-Pebblev1 tables have been rewritten. This format major version is
implicitly tied to a Cockroach internal cluster version that ships in
the 23.1 release. Not only is an implicit Pebble FMV upgrade confusing,
it unnecessarily blocking.

Rather than blocking the finalization of 23.1, the decision was made to
instead alter Cockroach to ensure that all tables had been rewritten
before ratcheting the Pebble FMV.

Rename the existing FMV for the blocking re-write. The version was
renamed to include `Unused`, along with a comment detailing the
reasoning.

Move the existing FMV, `FormatPrePebblev1MarkedCompacted`, to be the
latest in the sequence of FMVs, in a section dedicated to 23.2 FMVs.

Touches cockroachdb/cockroach#96819.
nicktrav added a commit to cockroachdb/pebble that referenced this issue Mar 1, 2023
Currently, the `FormatPrePebblev1MarkedCompacted` (enum value `11`) is
used as a synchronization point for the DB, ensuring that all
pre-Pebblev1 tables have been rewritten. This format major version is
implicitly tied to a Cockroach internal cluster version that ships in
the 23.1 release. Not only is an implicit Pebble FMV upgrade confusing,
it unnecessarily blocking.

Rather than blocking the finalization of 23.1, the decision was made to
instead alter Cockroach to ensure that all tables had been rewritten
before ratcheting the Pebble FMV.

Rename the existing FMV for the blocking re-write. The version was
renamed to include `Unused`, along with a comment detailing the
reasoning.

Move the existing FMV, `FormatPrePebblev1MarkedCompacted`, to be the
latest in the sequence of FMVs, in a section dedicated to 23.2 FMVs.

Touches cockroachdb/cockroach#96819.
@nicktrav
Copy link
Collaborator Author

nicktrav commented Mar 1, 2023

Paraphrasing a conversation I had with @jbowens:

Another option is to consider is having some kind of internal cluster version that runs first and acts as a fence, gating one or more preconditions that must be satisfied before the finalization could begin.

This is something we can consider for 23.2. It's probably too late in the scheme of things for 23.1.

I'm curious @jbowens @irfansharif - would that be any different how we'd do it now (i.e. cluster version linked to a Pebble FMV update?). The only difference is the fence version is catch-all for other preconditions.

Or would the idea be to not tie this to a Pebble FMV and instead using something like a Pebble metric to determine whether the precondition is satisfied or not. And then we have the usual cluster version / FMV ratchet in addition to this?

@jbowens
Copy link
Collaborator

jbowens commented Mar 1, 2023

Or would the idea be to not tie this to a Pebble FMV and instead using something like a Pebble metric to determine whether the precondition is satisfied or not. And then we have the usual cluster version / FMV ratchet in addition to this?

I think this is right.

@nicktrav
Copy link
Collaborator Author

nicktrav commented Mar 2, 2023

I made #97936 as a proposal to investigate the blocking / predicate based approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team
Projects
Status: 24.2 candidates
Development

No branches or pull requests

3 participants