Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.1: sql/catalog/tabledesc: permit zero-valued column IDs in DependedOnBy #82859

Merged
merged 1 commit into from
Jun 21, 2022

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Jun 13, 2022

Backport 1/1 commits from #82833 on behalf of @ajwerner.

/cc @cockroachdb/release


In 21.1 there was a bug whereby we would store a 0-value column ID in the
sequence's DependedOnBy because we'd add the depedency before allocating
an ID to the column. This bug was fixed in 21.2. Below is a reproduction
I've used to play with this bug:

roachprod wipe local
roachprod stage local release v21.1.3
roachprod run local -- mv cockroach cockroach-v21.1.3
roachprod stage local release v21.2.10
roachprod run local -- mv cockroach cockroach-v21.2.10
roachprod stage local release v22.1.1
roachprod run local -- mv cockroach cockroach-v22.1.1
roachprod start local --binary cockroach-v21.1.3
roachprod sql local -- -e "create table t1( i int primary key);
                           create table t2(i int primary key);
                           create sequence s1;
                           create sequence s2;
                           ALTER TABLE t1 ADD c1 BIGINT DEFAULT nextval('s1') NOT NULL;
                           ALTER TABLE t1 ADD c2 BIGINT DEFAULT nextval('s2') NOT NULL;
                           ALTER TABLE t2 ADD c1 BIGINT DEFAULT nextval('s1') NOT NULL;"
roachprod stop local
roachprod start local --binary cockroach-v21.2.10
while ! { roachprod sql local -- -e 'show cluster setting version' | grep 21.2 ; }; do sleep 1; done

roachprod sql local -- -e "alter table t1 add column c3 int default nextval('s1');
                           create table t3 (i int primary key);
                           alter table t3 add column c1 int default nextval('s1');"

roachprod stop local
roachprod start local --binary cockroach-v22.1.1
while ! { roachprod sql local -- -e 'show cluster setting version' | grep 22.1 ; }; do sleep 1; done

For now, for the rest of 22.1 we'll let this slide. As follow-up work, we'll
perform a migration to repair this situation and add back this validation once
the migration has been performed.

Relates to #82576.

Release note (bug fix): In earlier 22.1 releases of cockroach, added validation
could cause problems for descriptors which carried invalid backreferences due
to an earlier bug in 21.1. This stricter validation could result in a variety
of query failures. This patch weakens the validation to permit the corruption
as we know it. A subsequent patch in 22.2 will be created to repair the invalid
reference.


Release justification: This bug can cause incompatibilities when upgrading from an earlier version to 22.1.

In 21.1 there was a bug whereby we would store a 0-value column ID in the
sequence's DependedOnBy because we'd add the depedency before allocating
an ID to the column. This bug was fixed in 21.2. Below is a reproduction
I've used to play with this bug:

```bash
roachprod wipe local
roachprod stage local release v21.1.3
roachprod run local -- mv cockroach cockroach-v21.1.3
roachprod stage local release v21.2.10
roachprod run local -- mv cockroach cockroach-v21.2.10
roachprod stage local release v22.1.1
roachprod run local -- mv cockroach cockroach-v22.1.1
roachprod start local --binary cockroach-v21.1.3
roachprod sql local -- -e "create table t1( i int primary key);
                           create table t2(i int primary key);
                           create sequence s1;
                           create sequence s2;
                           ALTER TABLE t1 ADD c1 BIGINT DEFAULT nextval('s1') NOT NULL;
                           ALTER TABLE t1 ADD c2 BIGINT DEFAULT nextval('s2') NOT NULL;
                           ALTER TABLE t2 ADD c1 BIGINT DEFAULT nextval('s1') NOT NULL;"
roachprod stop local
roachprod start local --binary cockroach-v21.2.10
while ! { roachprod sql local -- -e 'show cluster setting version' | grep 21.2 ; }; do sleep 1; done

roachprod sql local -- -e "alter table t1 add column c3 int default nextval('s1');
                           create table t3 (i int primary key);
                           alter table t3 add column c1 int default nextval('s1');"

roachprod stop local
roachprod start local --binary cockroach-v22.1.1
while ! { roachprod sql local -- -e 'show cluster setting version' | grep 22.1 ; }; do sleep 1; done
```

For now, for the rest of 22.1 we'll let this slide. As follow-up work, we'll
perform a migration to repair this situation and add back this validation once
the migration has been performed.

Release note (bug fix): In earlier 22.1 releases of cockroach, added validation
could cause problems for descriptors which carried invalid backreferences due
to an earlier bug in 21.1. This stricter validation could result in a variety
of query failures. This patch weakens the validation to permit the corruption
as we know it. A subsequent patch in 22.2 will be created to repair the invalid
reference.
@blathers-crl blathers-crl bot requested a review from a team June 13, 2022 22:10
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-22.1-82833 branch from b96f686 to 5f38e71 Compare June 13, 2022 22:10
@blathers-crl
Copy link
Author

blathers-crl bot commented Jun 13, 2022

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Jun 13, 2022
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner merged commit 55b3087 into release-22.1 Jun 21, 2022
@ajwerner ajwerner deleted the blathers/backport-release-22.1-82833 branch June 21, 2022 14:28
craig bot pushed a commit that referenced this pull request Aug 26, 2022
85990: metric: add prometheus-based histogram r=aadityasondhi a=aadityasondhi

From #81181:

Our current histogram is based on `hdrhistogram`. This tends to create
lots of buckets and is inflexible w.r.t the bucket layout. In hindsight,
it was a poor choice of default implementation (I can say that since I
introduced it) and its cost is disincentivizing the introduction of
histograms that would be useful.

This commit introduces a histogram that is based on a completely vanilla
`prometheus.Histogram`. The only reason we need to wrap it is because we
want to export quantiles to CockraochDB's internal timeseries (it does
not support histograms) and this requires maintaining an internal windowed
histogram (on top of the cumulative histogram).

With this done, we can now introduce metrics with any kind of buckets we
want. Helpfully, we introduce two common kinds of buckets, suitable for
IO-type and RPC-type latencies. These are defined in a human-readable
format by explicitly listing out the buckets.

We can move existing metrics to HistogramV2 easily, assuming we are not
concerned with existing prometheus scrapers getting tripped up by the
changes in bucket boundaries. I assume this is not a big deal in
practice as long as it doesn't happen "all the time". In fact, I would
strongly suggest we move all metrics wholesale and remove the
hdrhistogram-based implementation. If this is not acceptable for some
reason, we ought to at least deprecated it.

We also slightly improve the existing `Histogram` code by unifying how
the windowed histograms are ticked and by making explicit where their
quantiles are recorded (this dependency was previously hidden via a
local interface assertion).

Resolves #10015.
Resolves #64962.
Alternative to dhartunian@eac3d06

Release justification: low risk, high benefit changes

86007: kvserver: log traces from replicate queue on errors or slow processing r=andreimatei a=AlexTalks

While we previously had some logging from the replicate queue as a
result of the standard queue logging, this change adds logging to the
replicate queue when there are errors in processing a replica, or when
processing a replica exceeds a 50% of the timeout duration.
When there are errors or the duration threshold is exceeded,
any error messages are logged along with the collected tracing spans
from the operation.

Release note (ops change): Added logging on replicate queue processing
in the presence of errors or when the duration exceeds 50% of the
timeout.

Release justification: Low risk observability change.

86255: upgrades,upgrade,upgrades_test: Added an upgrade for updating invalid column ID in seq's back references r=Xiang-Gu a=Xiang-Gu

commit 1: a small refactoring of existing function
commit 2: added a field to TenantDeps
commit 3: added a cluster version upgrade that attempts to update
invalid column ID in seq's back references due to bugs in prior versions.
See below for details
commit 4: wrote a test for this newly added upgrade

Previously, in version 21.1 and prior, `ADD COLUMN DEFAULT nextval('s')`
will incorrectly store a 0-valued column id in the sequence 's' back reference
`dependedOnBy` because we added this dependency before allocating an
ID to the newly added column. Customers ran into issues when upgrading
to v22.1 so we relaxed the validation logic as a quick short-term fix, as detailed
in #82859. 

Now we want to give this issue a more proper treatment by adding a cluster
upgrade (to v22.2) that attempts to detect such invalid column ID issues and 
update them with the correct column ID. This PR does exactly this.

Fixes: #83124

Release note: None

Release justification: fixed a release blocker that will resolve invalid column ID
appearance in sequence's back referenced, caused by bugs in older binaries.

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Aaditya Sondhi <aadityas@cockroachlabs.com>
Co-authored-by: Alex Sarkesian <sarkesian@cockroachlabs.com>
Co-authored-by: Xiang Gu <xiang@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants