Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add recovery guide for two dc deployment #9804

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Oreoxmt
Copy link
Collaborator

@Oreoxmt Oreoxmt commented Aug 3, 2022

First-time contributors' checklist

What is changed, added or deleted? (Required)

Which TiDB version(s) do your changes apply to? (Required)

Tips for choosing the affected version(s):

By default, CHOOSE MASTER ONLY so your changes will be applied to the next TiDB major or minor releases. If your PR involves a product feature behavior change or a compatibility change, CHOOSE THE AFFECTED RELEASE BRANCH(ES) AND MASTER.

For details, see tips for choosing the affected versions.

  • master (the latest development version)
  • v6.2 (TiDB 6.2 versions)
  • v6.1 (TiDB 6.1 versions)
  • v5.4 (TiDB 5.4 versions)
  • v5.3 (TiDB 5.3 versions)
  • v5.2 (TiDB 5.2 versions)
  • v5.1 (TiDB 5.1 versions)
  • v5.0 (TiDB 5.0 versions)

What is the related PR or file link(s)?

Do your changes match any of the following descriptions?

  • Delete files
  • Change aliases
  • Need modification after applied to another branch
  • Might cause conflicts after applied to another branch

@Oreoxmt Oreoxmt added translation/from-docs-cn This PR is translated from a PR in pingcap/docs-cn. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. area/engine Indicates that the Issue or PR belongs to the area of TP storage or Cloud storage. v6.2 This PR/issue applies to TiDB v6.2. labels Aug 3, 2022
@Oreoxmt Oreoxmt self-assigned this Aug 3, 2022
@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 3, 2022
@ti-chi-bot ti-chi-bot requested a review from shichun-0415 August 3, 2022 10:31
@ti-chi-bot ti-chi-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 3, 2022
tikv-control.md Outdated Show resolved Hide resolved
@Oreoxmt Oreoxmt force-pushed the translate/docs-cn-8669 branch from d69dde0 to 190a308 Compare August 4, 2022 08:46
@Oreoxmt Oreoxmt changed the title WIP: add recovery guide for two dc deployment add recovery guide for two dc deployment Aug 4, 2022
@ti-chi-bot ti-chi-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 4, 2022
@Oreoxmt Oreoxmt requested review from nolouch, TomShawn and disksing and removed request for shichun-0415 August 4, 2022 08:48
@Oreoxmt Oreoxmt added status/PTAL This PR is ready for reviewing. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Aug 4, 2022
@Oreoxmt
Copy link
Collaborator Author

Oreoxmt commented Aug 4, 2022

/verify

@Oreoxmt
Copy link
Collaborator Author

Oreoxmt commented Aug 5, 2022

@nolouch @disksing PTAL

tikv-control.md Outdated
@@ -463,6 +463,26 @@ success!
> - The argument of the `-p` option specifies the PD endpoints without the `http` prefix. Specifying the PD endpoints is to query whether the specified `region_id` is validated or not.
> - You need to run this command for all stores where specified Regions' peers are located.

### Recover from ACID inconsistency data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Recover from ACID inconsistency data
### Recover ACID-inconsistent data

tikv-control.md Outdated
@@ -463,6 +463,26 @@ success!
> - The argument of the `-p` option specifies the PD endpoints without the `http` prefix. Specifying the PD endpoints is to query whether the specified `region_id` is validated or not.
> - You need to run this command for all stores where specified Regions' peers are located.

### Recover from ACID inconsistency data

To recover data from ACID inconsistency, such as the loss of most replicas or incomplete data synchronization, you can use the `reset-to-version` command. When using this command, you need to provide an old version number that can promise the ACID consistency. Then, `tikv-ctl` cleans up all data after the specified version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To recover data from ACID inconsistency, such as the loss of most replicas or incomplete data synchronization, you can use the `reset-to-version` command. When using this command, you need to provide an old version number that can promise the ACID consistency. Then, `tikv-ctl` cleans up all data after the specified version.
To recover data that breaks ACID consistency, such as the loss of most replicas or incomplete data replication, you can use the `reset-to-version` command. When using this command, you need to provide an old version number that can promise the ACID consistency. Then, `tikv-ctl` cleans up all data after the specified version.

tikv-control.md Outdated

To recover data from ACID inconsistency, such as the loss of most replicas or incomplete data synchronization, you can use the `reset-to-version` command. When using this command, you need to provide an old version number that can promise the ACID consistency. Then, `tikv-ctl` cleans up all data after the specified version.

- The `-v` option is used to specify the version number to restore. To get the value of the `-v` parameter, you can use the `pd-ctl min-resolved-ts` command.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The `-v` option is used to specify the version number to restore. To get the value of the `-v` parameter, you can use the `pd-ctl min-resolved-ts` command.
- The `-v` option is used to specify the version number to recover. To get the value of the `-v` option, you can use the `pd-ctl min-resolved-ts` command in PD Control.

tikv-control.md Outdated

> **Note:**
>
> - The preceding command only supports the online mode. Before executing the command, you need to stop processes that will write data to TiKV, such as TiDB. After the command is executed successfully, it will return `success!`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> - The preceding command only supports the online mode. Before executing the command, you need to stop processes that will write data to TiKV, such as TiDB. After the command is executed successfully, it will return `success!`.
> - The preceding command only supports the online mode. Before running the command, you need to stop the processes that will write data to TiKV, such as TiDB. After the command is run successfully, `success!` is returned in the output.

tikv-control.md Outdated
> **Note:**
>
> - The preceding command only supports the online mode. Before executing the command, you need to stop processes that will write data to TiKV, such as TiDB. After the command is executed successfully, it will return `success!`.
> - You need to execute the same command for all TiKV nodes in the cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> - You need to execute the same command for all TiKV nodes in the cluster.
> - You need to run the same command for all TiKV nodes in the cluster.

- If the status before failure is switching from the asynchronous to synchronous (the status code is `sync-recover`), part of the written data in the primary data center in the asynchronous replication mode is lost after using the secondary data center to recover. This might cause the ACID inconsistency, and you need to recover it additionally. A typical scenario is that the primary data center disconnects from the secondary data center, the connection is restored after switching to the asynchronous mode, and data is written. But during the data synchronization between primary and secondary, something goes wrong and causes the overall failure of the primary data center.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning is inaccurate.

1. Stop all PD, TiKV, and TiDB services of the secondary data center.

2. Start PD nodes of the secondary data center using the single replica mode with the [`--force-new-cluster`](/command-line-flags-for-pd-configuration.md#--force-new-cluster) flag.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Start PD nodes of the secondary data center using the single replica mode with the [`--force-new-cluster`](/command-line-flags-for-pd-configuration.md#--force-new-cluster) flag.
2. Start PD nodes of the secondary data center in the single replica mode with the [`--force-new-cluster`](/command-line-flags-for-pd-configuration.md#--force-new-cluster) flag.

2. Start PD nodes of the secondary data center using the single replica mode with the [`--force-new-cluster`](/command-line-flags-for-pd-configuration.md#--force-new-cluster) flag.

3. Use the [Online Unsafe Recovery](/online-unsafe-recovery.md) to process the TiKV data in the secondary data center and the parameters are the list of all Store IDs in the primary data center.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. Use the [Online Unsafe Recovery](/online-unsafe-recovery.md) to process the TiKV data in the secondary data center and the parameters are the list of all Store IDs in the primary data center.
3. Follow the instructions in [Online Unsafe Recovery](/online-unsafe-recovery.md) to process the TiKV data of the secondary data center. The parameters are the list of all Store IDs in the primary data center.

3. Use the [Online Unsafe Recovery](/online-unsafe-recovery.md) to process the TiKV data in the secondary data center and the parameters are the list of all Store IDs in the primary data center.

4. Write a new configuration of placement rule using [PD Control](/pd-control.md), and the Voter replica configuration of the Region is the same as the original cluster in the secondary data center.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. Write a new configuration of placement rule using [PD Control](/pd-control.md), and the Voter replica configuration of the Region is the same as the original cluster in the secondary data center.
4. Write a new placement rule configuration file and use it in [PD Control](/pd-control.md). In the configuration file, the Voter replica count of the Region is the same as that of the original cluster in the secondary data center.

5. Start the PD and TiKV services of the primary data center.

6. To recover ACID consistency (the status of `DR_STATE` in the old PD is `sync-recover`), you can use [`reset-to-version`](/tikv-control.md#recover-from-acid-inconsistency-data) to process TiKV data and the `version` parameter used can be obtained from `pd-ctl min-resolved-ts`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
6. To recover ACID consistency (the status of `DR_STATE` in the old PD is `sync-recover`), you can use [`reset-to-version`](/tikv-control.md#recover-from-acid-inconsistency-data) to process TiKV data and the `version` parameter used can be obtained from `pd-ctl min-resolved-ts`.
6. To perform an ACID-consistent data recovery (the status of `DR_STATE` in the old PD is `sync-recover`), you can use [the `reset-to-version` command of TiKV Control](/tikv-control.md#recover-from-acid-inconsistency-data) to process TiKV data. The `version` parameter can be obtained by running `pd-ctl min-resolved-ts` in PD Control.

@Oreoxmt Oreoxmt requested a review from TomShawn August 8, 2022 06:12

## `--force-new-cluster`

- Force to create a new cluster using current nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Force to create a new cluster using current nodes.
- Forcibly creates a new cluster using current nodes.


- Force to create a new cluster using current nodes.
- Default: `false`
- It is recommended to use this flag only when recovering services due to PD losing most replicas, which might cause data loss.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- It is recommended to use this flag only when recovering services due to PD losing most replicas, which might cause data loss.
- It is recommended to use this flag only for recovering services when PD loses most of its replicas, which might cause data loss.

tikv-control.md Outdated

> **Note:**
>
> - The preceding command only supports the online mode. Before running the command, you need to stop the processes that will write data to TiKV, such as TiDB. After the command is run successfully, `success!` is returned in the output.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> - The preceding command only supports the online mode. Before running the command, you need to stop the processes that will write data to TiKV, such as TiDB. After the command is run successfully, `success!` is returned in the output.
> - The preceding command only supports the online mode. Before running the command, you need to stop the processes that will write data to TiKV, such as the TiDB processes. After the command is run successfully, `success!` is returned in the output.

When a disaster occurs to a cluster in the synchronous replication mode, you can perform data recovery with `RPO = 0`:
> **Tip:**
>
> If you need support for disaster recovery, you can contact the TiDB team for a recovery solution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> If you need support for disaster recovery, you can contact the TiDB team for a recovery solution.
> If you need support for disaster recovery, contact the TiDB team for a recovery solution.


When a disaster occurs to a cluster that is not in the synchronous replication mode and you cannot perform data recovery with `RPO = 0`:
- If the cluster before failure is in the synchronous replication mode (the status code is `sync` or `async_wait`), you can use the secondary DC to recover with `RPO = 0`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If the cluster before failure is in the synchronous replication mode (the status code is `sync` or `async_wait`), you can use the secondary DC to recover with `RPO = 0`.
- If the cluster before failure is in the synchronous replication mode (the status code is `sync` or `async_wait`), you can use the secondary DC to recover data with `RPO = 0`.

- If the cluster before failure is in the asynchronous replication mode (the status code is `async`), after recovering the primary DC with the data of the secondary DC, the data written from the primary DC to the secondary DC before the failure in the asynchronous replication mode will be lost. A typical scenario is that the primary DC disconnects from the secondary DC and the primary DC switches to the asynchronous replication mode and provides service for a while before the overall failure.

- If the cluster before failure is in synchronous recovery mode (the status code is `sync-recover`). After using the secondary DC to recover the service, some data written by the primary DC in the asynchronous replication mode might be lost. This might break the ACID consistency and you need to recover the ACID-inconsistent data additionally. A typical scenario is that the primary DC disconnects from the secondary DC and the connection is recovered after some data is written to the primary DC in the asynchronous replication mode. But during the asynchronous replication between primary and secondary, something goes wrong and causes the primary DC to fail as a whole.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If the cluster before failure is in synchronous recovery mode (the status code is `sync-recover`). After using the secondary DC to recover the service, some data written by the primary DC in the asynchronous replication mode might be lost. This might break the ACID consistency and you need to recover the ACID-inconsistent data additionally. A typical scenario is that the primary DC disconnects from the secondary DC and the connection is recovered after some data is written to the primary DC in the asynchronous replication mode. But during the asynchronous replication between primary and secondary, something goes wrong and causes the primary DC to fail as a whole.
- If the cluster before failure is switching from asynchronous to synchronous mode (the status code is `sync-recover`), after using the secondary DC to recover the service, some data asynchronously replicated from the primary DC to the secondary DC will be lost. This might break the ACID consistency, and you need to recover the ACID-inconsistent data accordingly. A typical scenario is that the primary DC disconnects from the secondary DC. After some data is replicated to the primary DC in the asynchronous mode, the connection is recovered. But during the asynchronous replication, errors occur again and cause the primary DC to fail as a whole.

1. Stop all PD, TiKV, and TiDB services of the secondary DC.

2. Start PD nodes of the secondary DC in the single replica mode with the [`--force-new-cluster`](/command-line-flags-for-pd-configuration.md#--force-new-cluster) flag.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is single replica mode? It is not mentioned anywhere.

3. Follow the instructions in [Online Unsafe Recovery](/online-unsafe-recovery.md) to process the TiKV data of the secondary DC. The parameters are the list of all Store IDs in the primary DC.

4. Write a new placement rule configuration file and use it in [PD Control](/pd-control.md). In the configuration file, the Voter replica count of the Region is the same as that of the original cluster in the secondary DC.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. Write a new placement rule configuration file and use it in [PD Control](/pd-control.md). In the configuration file, the Voter replica count of the Region is the same as that of the original cluster in the secondary DC.
4. Write a new placement rule configuration file and use it in [PD Control](/pd-control.md). In the configuration file, the Voter replica count of the Region is the same as that of the original cluster in the secondary DC.

@Oreoxmt Oreoxmt added for-future-release This PR only applies to master for now and might cherry-pick to a future release. and removed status/PTAL This PR is ready for reviewing. labels Aug 9, 2022
@Oreoxmt Oreoxmt removed the v6.2 This PR/issue applies to TiDB v6.2. label Aug 9, 2022
@Oreoxmt Oreoxmt added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2023
@Oreoxmt Oreoxmt marked this pull request as draft January 16, 2023 07:00
@ti-chi-bot ti-chi-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 16, 2023
@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 18, 2023
@ti-chi-bot
Copy link
Member

@Oreoxmt: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

ti-chi-bot bot commented Jan 26, 2024

@Oreoxmt: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-verify 8d2b228 link true /test pull-verify

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/engine Indicates that the Issue or PR belongs to the area of TP storage or Cloud storage. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. for-future-release This PR only applies to master for now and might cherry-pick to a future release. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. translation/from-docs-cn This PR is translated from a PR in pingcap/docs-cn.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants