Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error for stuck ALTER DATABASE causes job to hang #131342

Closed
kevinkokomani opened this issue Sep 25, 2024 · 5 comments · Fixed by #135168
Closed

Error for stuck ALTER DATABASE causes job to hang #131342

kevinkokomani opened this issue Sep 25, 2024 · 5 comments · Fixed by #135168
Assignees
Labels
branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@kevinkokomani
Copy link
Contributor

kevinkokomani commented Sep 25, 2024

Describe the problem

A customer attempted to run an ALTER DATABASE to add a newly added region to their database's zone configurations:

ALTER DATABASE db_name ADD REGION "region_name";

The behavior experienced was that this ran for two hours, seemingly stuck at 0% when checking the job's progress via the DB Console -> Jobs page. Running the statement again would yield the following error:

ERROR: "region "region_name" already added to database

However, running SHOW REGIONS FROM DATABASE db_name; disagreed with that error output above - the region did not show up in the output, namely, "region_name" does not appear below:

   database  |      region      | primary | secondary |                           zones

-------------+------------------+---------+-----------+-------------------------------------------------------------

  db_name | other_region1     |    t    |     f     | {other_region_1a, other_region_1b, other_region_1c}

  db_name | other_region2     |    f    |     f     | {other_region_2a, other_region_2b, other_region_2c}

  db_name | other_region3     |    f    |     f     | {other_region_3a, other_region_3b, other_region_3c}

It was only when running show job <job_id>; for the job ID that is shown via the DB Console -> jobs page that the actual cause for the error was revealed:

{"non-cancelable: could not validate zone config: RangeMaxBytes 20971520 less than minimum allowed 67108864"}

For database or table objects created sufficiently long ago when the default RangeMaxBytes and RangeMinBytes were much lower, and that haven't been altered since, this is prone to happen. It doesn't appear that we have any automation during the upgrade progress that would change the defaults of these objects if there are new defaults (rightfully so, as we likely don't want to silently change values during a routine upgrade without approval from the operator).

There are two main "problems" as it seems based on the above:

  1. The Jobs page did not clearly show this error. Whether the error should also have a recommendation as to what to do next, I'm not sure, as it could be a number of actions. This will be handled in a separate issue.
  2. The ALTER DATABASE ADD REGION job probably shouldn't have been stuck in this state at all.

To Reproduce

Should be reproducable with the following:

  1. Create a cluster with an old range size default, or otherwise somehow force the range size to be lower than default
  2. Create a database: create database test_db;
  3. Add a primary region, which should succeed: alter database test_db primary region "us-east-1";
  4. Upgrade to a modern version with updated defaults
  5. Attempt to add a secondary region: alter database test_db add region "us-west-1";
  6. You should get the same error

Expected behavior

Given the main problems:

  1. I would expect the jobs page to show any error the job has hit, which it doesn't look like it's doing in this case. Perhaps there is room for a recommendation on what to do next - perhaps it should just say to reach out to support. This will be done in a separate github issue.
  2. Ideally, the ALTER DATABASE ADD REGION or any similar job would roll back instead of hanging in this state where show regions from database does not match the current state of the database.

Environment:

Any version in which there was an upgraded range size default and a cluster has been upgraded to that version

Additional context

Not knowing where the error is or how to fix it can block critical production deployments.

Jira issue: CRDB-42507

Epic CRDB-43310

@kevinkokomani kevinkokomani added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 25, 2024
Copy link

blathers-crl bot commented Sep 25, 2024

Hi @kevinkokomani, please add branch-* labels to identify which branch(es) this C-bug affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@kevinkokomani kevinkokomani added 24.2 O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Sep 25, 2024
@rafiss
Copy link
Collaborator

rafiss commented Oct 1, 2024

@kevinkokomani Is there a link to an escalation that provides more context? I am curious what the state of the job was when you ran show job <job_id>;. Maybe it was stuck retrying (which would also be a bug).

For the issue about the error not being surfaced in the DB Console, that would be an o11y issue. I'm going to rename this issue so it's just focused on the hanging job issue. Please file a separate issue for the o11y team to investigate. the DB Console problem.

@rafiss rafiss changed the title Error for stuck ALTER DATABASE not surfaced in DB Console; job hangs Error for stuck ALTER DATABASE causes job to hang Oct 1, 2024
@rafiss rafiss added branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 and removed 24.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 labels Oct 1, 2024
@exalate-issue-sync exalate-issue-sync bot added the branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 label Oct 1, 2024
@exalate-issue-sync exalate-issue-sync bot added P-2 Issues/test failures with a fix SLA of 3 months and removed branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 labels Oct 1, 2024
@rafiss
Copy link
Collaborator

rafiss commented Oct 28, 2024

For the issue where the could not validate zone config caused the job to fail, let's check if marking the error as non-retryable would resolve this.

@kevinkokomani
Copy link
Contributor Author

@rafiss you can see the full context here: https://cockroachdb.zendesk.com/agent/tickets/23648.

fqazi added a commit to fqazi/cockroach that referenced this issue Nov 14, 2024
Previously, schema changes for databases involving zone configuration
modifications could hang. This occurred because the system didn't
validate the existing zone configuration's validity before initiating
these operations. Since users can manually modify zone configurations,
they might inadvertently introduce invalid states. To address this, this
patch validates the zone configuration before the following database
operations: setting the primary region, setting the secondary region,
adding a region, and dropping a region.

Fixes: cockroachdb#131342

Release note (bug fix): Prevent ALTER DATABASE operations that modify
the zone config from hanging if an invalid zone config already exists.
fqazi added a commit to fqazi/cockroach that referenced this issue Nov 14, 2024
Previously, schema changes for databases involving zone configuration
modifications could hang. This occurred because the system didn't
validate the existing zone configuration's validity before initiating
these operations. Since users can manually modify zone configurations,
they might inadvertently introduce invalid states. To address this, this
patch validates the zone configuration before the following database
operations: setting the primary region, setting the secondary region,
adding a region, and dropping a region.

Fixes: cockroachdb#131342

Release note (bug fix): Prevent ALTER DATABASE operations that modify
the zone config from hanging if an invalid zone config already exists.
craig bot pushed a commit that referenced this issue Nov 14, 2024
135168: sql: validate zone config before multi region database DDL r=fqazi a=fqazi

Previously, schema changes for databases involving zone configuration modifications could hang. This occurred because the system didn't validate the existing zone configuration's validity before initiating these operations. Since users can manually modify zone configurations, they might inadvertently introduce invalid states. To address this, this patch validates the zone configuration before the following database operations: setting the primary region, setting the secondary region, adding a region, and dropping a region.

Fixes: #131342

Release note (bug fix): Prevent ALTER DATABASE operations that modify the zone config from hanging if an invalid zone config already exists.

Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Nov 14, 2024
135168: sql: validate zone config before multi region database DDL r=fqazi a=fqazi

Previously, schema changes for databases involving zone configuration modifications could hang. This occurred because the system didn't validate the existing zone configuration's validity before initiating these operations. Since users can manually modify zone configurations, they might inadvertently introduce invalid states. To address this, this patch validates the zone configuration before the following database operations: setting the primary region, setting the secondary region, adding a region, and dropping a region.

Fixes: #131342

Release note (bug fix): Prevent ALTER DATABASE operations that modify the zone config from hanging if an invalid zone config already exists.

Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
@craig craig bot closed this as completed in 1a32cbe Nov 14, 2024
Copy link

blathers-crl bot commented Nov 14, 2024

Based on the specified backports for linked PR #135168, I applied the following new label(s) to this issue: branch-release-24.3. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 label Nov 14, 2024
blathers-crl bot pushed a commit that referenced this issue Nov 14, 2024
Previously, schema changes for databases involving zone configuration
modifications could hang. This occurred because the system didn't
validate the existing zone configuration's validity before initiating
these operations. Since users can manually modify zone configurations,
they might inadvertently introduce invalid states. To address this, this
patch validates the zone configuration before the following database
operations: setting the primary region, setting the secondary region,
adding a region, and dropping a region.

Fixes: #131342

Release note (bug fix): Prevent ALTER DATABASE operations that modify
the zone config from hanging if an invalid zone config already exists.
blathers-crl bot pushed a commit that referenced this issue Nov 14, 2024
Previously, schema changes for databases involving zone configuration
modifications could hang. This occurred because the system didn't
validate the existing zone configuration's validity before initiating
these operations. Since users can manually modify zone configurations,
they might inadvertently introduce invalid states. To address this, this
patch validates the zone configuration before the following database
operations: setting the primary region, setting the secondary region,
adding a region, and dropping a region.

Fixes: #131342

Release note (bug fix): Prevent ALTER DATABASE operations that modify
the zone config from hanging if an invalid zone config already exists.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants