-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error for stuck ALTER DATABASE causes job to hang #131342
Comments
Hi @kevinkokomani, please add branch-* labels to identify which branch(es) this C-bug affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
@kevinkokomani Is there a link to an escalation that provides more context? I am curious what the state of the job was when you ran For the issue about the error not being surfaced in the DB Console, that would be an o11y issue. I'm going to rename this issue so it's just focused on the hanging job issue. Please file a separate issue for the o11y team to investigate. the DB Console problem. |
For the issue where the |
@rafiss you can see the full context here: https://cockroachdb.zendesk.com/agent/tickets/23648. |
Previously, schema changes for databases involving zone configuration modifications could hang. This occurred because the system didn't validate the existing zone configuration's validity before initiating these operations. Since users can manually modify zone configurations, they might inadvertently introduce invalid states. To address this, this patch validates the zone configuration before the following database operations: setting the primary region, setting the secondary region, adding a region, and dropping a region. Fixes: cockroachdb#131342 Release note (bug fix): Prevent ALTER DATABASE operations that modify the zone config from hanging if an invalid zone config already exists.
Previously, schema changes for databases involving zone configuration modifications could hang. This occurred because the system didn't validate the existing zone configuration's validity before initiating these operations. Since users can manually modify zone configurations, they might inadvertently introduce invalid states. To address this, this patch validates the zone configuration before the following database operations: setting the primary region, setting the secondary region, adding a region, and dropping a region. Fixes: cockroachdb#131342 Release note (bug fix): Prevent ALTER DATABASE operations that modify the zone config from hanging if an invalid zone config already exists.
135168: sql: validate zone config before multi region database DDL r=fqazi a=fqazi Previously, schema changes for databases involving zone configuration modifications could hang. This occurred because the system didn't validate the existing zone configuration's validity before initiating these operations. Since users can manually modify zone configurations, they might inadvertently introduce invalid states. To address this, this patch validates the zone configuration before the following database operations: setting the primary region, setting the secondary region, adding a region, and dropping a region. Fixes: #131342 Release note (bug fix): Prevent ALTER DATABASE operations that modify the zone config from hanging if an invalid zone config already exists. Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
135168: sql: validate zone config before multi region database DDL r=fqazi a=fqazi Previously, schema changes for databases involving zone configuration modifications could hang. This occurred because the system didn't validate the existing zone configuration's validity before initiating these operations. Since users can manually modify zone configurations, they might inadvertently introduce invalid states. To address this, this patch validates the zone configuration before the following database operations: setting the primary region, setting the secondary region, adding a region, and dropping a region. Fixes: #131342 Release note (bug fix): Prevent ALTER DATABASE operations that modify the zone config from hanging if an invalid zone config already exists. Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Based on the specified backports for linked PR #135168, I applied the following new label(s) to this issue: branch-release-24.3. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Previously, schema changes for databases involving zone configuration modifications could hang. This occurred because the system didn't validate the existing zone configuration's validity before initiating these operations. Since users can manually modify zone configurations, they might inadvertently introduce invalid states. To address this, this patch validates the zone configuration before the following database operations: setting the primary region, setting the secondary region, adding a region, and dropping a region. Fixes: #131342 Release note (bug fix): Prevent ALTER DATABASE operations that modify the zone config from hanging if an invalid zone config already exists.
Previously, schema changes for databases involving zone configuration modifications could hang. This occurred because the system didn't validate the existing zone configuration's validity before initiating these operations. Since users can manually modify zone configurations, they might inadvertently introduce invalid states. To address this, this patch validates the zone configuration before the following database operations: setting the primary region, setting the secondary region, adding a region, and dropping a region. Fixes: #131342 Release note (bug fix): Prevent ALTER DATABASE operations that modify the zone config from hanging if an invalid zone config already exists.
Describe the problem
A customer attempted to run an
ALTER DATABASE
to add a newly added region to their database's zone configurations:ALTER DATABASE db_name ADD REGION "region_name";
The behavior experienced was that this ran for two hours, seemingly stuck at 0% when checking the job's progress via the DB Console -> Jobs page. Running the statement again would yield the following error:
ERROR: "region "region_name" already added to database
However, running
SHOW REGIONS FROM DATABASE db_name;
disagreed with that error output above - the region did not show up in the output, namely, "region_name" does not appear below:It was only when running
show job <job_id>;
for the job ID that is shown via the DB Console -> jobs page that the actual cause for the error was revealed:For database or table objects created sufficiently long ago when the default
RangeMaxBytes
andRangeMinBytes
were much lower, and that haven't beenaltered
since, this is prone to happen. It doesn't appear that we have any automation during the upgrade progress that would change the defaults of these objects if there are new defaults (rightfully so, as we likely don't want to silently change values during a routine upgrade without approval from the operator).There are two main "problems" as it seems based on the above:
The Jobs page did not clearly show this error. Whether the error should also have a recommendation as to what to do next, I'm not sure, as it could be a number of actions.This will be handled in a separate issue.ALTER DATABASE ADD REGION
job probably shouldn't have been stuck in this state at all.To Reproduce
Should be reproducable with the following:
create database test_db;
alter database test_db primary region "us-east-1";
alter database test_db add region "us-west-1";
Expected behavior
Given the main problems:
I would expect the jobs page to show any error the job has hit, which it doesn't look like it's doing in this case. Perhaps there is room for a recommendation on what to do next - perhaps it should just say to reach out to support.This will be done in a separate github issue.show regions from database
does not match the current state of the database.Environment:
Any version in which there was an upgraded range size default and a cluster has been upgraded to that version
Additional context
Not knowing where the error is or how to fix it can block critical production deployments.
Jira issue: CRDB-42507
Epic CRDB-43310
The text was updated successfully, but these errors were encountered: