Skip to content

Commit 82361ea

Browse files
committed
Addressing feedback
Signed-off-by: ytimocin <ytimocin@microsoft.com>
1 parent 6e9fae3 commit 82361ea

File tree

1 file changed

+46
-73
lines changed

1 file changed

+46
-73
lines changed

architecture/2025-03-upgrade-feature-spec.md

Lines changed: 46 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
The Radius in-place upgrade feature aims to streamline the process of upgrading the control plane components of Radius, ensuring minimal downtime and disruption. **Importantly, user applications deployed through Radius continue running without interruption during the upgrade process, as Radius only maintains metadata about these applications and does not manage their runtime execution.**
88

9-
**Radius Components (as of Mar 2025):**
9+
**Radius Components (as of June 2025):**
1010

1111
- Universal Control Plane
1212
- Radius Deployment Engine
@@ -17,34 +17,38 @@ The Radius in-place upgrade feature aims to streamline the process of upgrading
1717

1818
**Dependencies:**
1919

20-
- Dapr
2120
- Contour
22-
- Postgres
21+
- etcd (current datastore)
2322

2423
**Key Features:**
2524

26-
- **Compatibility:** Ensures compatibility with new versions.
27-
- **Data Safety:** Takes snapshots before the upgrade to safeguard user data, with automatic rollback in case of failure.
25+
- **Compatibility:** Ensures compatibility with new versions through robust preflight checks.
26+
- **Safety:** Utilizes Helm's built-in rollback capability for recovery in case of failure (with full data backup/restore planned for future versions).
2827
- **Seamless Experience:** Provides a smooth and reliable upgrade process with minimal downtime.
2928
- **Application Continuity:** User applications remain operational throughout the upgrade process, as Radius only manages deployment metadata, not runtime execution.
29+
- **Distributed Locking:** Prevents concurrent data modifications during upgrades to maintain system integrity.
3030

3131
This feature will significantly improve the user experience by automating the upgrade process, reducing the risk of errors, and maintaining system stability.
3232

3333
### Top level goals
3434

35-
- Deliver a seamless and intuitive upgrade experience for users.
35+
- Deliver a seamless and intuitive upgrade experience for users through a single `rad upgrade kubernetes` command.
3636
- Ensure a smooth and reliable upgrade process for the components and dependencies of Radius.
3737
- Minimize downtime and disruption during upgrades, ensuring that the Radius Control Plane is available as soon as possible.
38-
- Ensure user data safety during the upgrade process by taking snapshots and providing rollback capabilities.
38+
- Implement robust preflight checks to prevent upgrades in unsuitable conditions.
39+
- Implement Helm-based rollback capability to recover from failed upgrades.
3940
- Provide clear documentation and guidance for users performing upgrades.
40-
- Maintain system performance during the upgrade process to avoid significant impact on running applications.
41+
- Implement a data-store-level lock mechanism to prevent concurrent modifications during upgrades.
4142

4243
### Non-goals (out of scope)
4344

44-
- Downgrades are not supported. The upgrade process is designed to move forward to newer versions only.
45-
- The upgrade process does not include features for managing upgrades across multiple clusters simultaneously as Radius doesn't support multiple clusters per installation as of March 2025.
46-
- Upgrading major versions of dependencies like Postgres, Dapr, or Contour is not covered by this process. These upgrades should be handled separately following their respective guidelines.
47-
- Upgrading the Radius control plane using Helm. We can run `helm upgrade` on the Radius Helm installation but that is not going to put all the necessary pieces together for the control plane to work. Making this work is not in the scope of this work.
45+
- Downgrades are not supported in the initial version. The upgrade process is designed to move forward to newer versions only. Please see the design document and the development plan for more details.
46+
- User data backup and restore will be implemented in a future version.
47+
- The upgrade process does not include features for managing upgrades across multiple clusters simultaneously as Radius doesn't support multiple clusters per installation as of June 2025.
48+
- Upgrading major versions of dependencies like Contour is not covered by this process. These upgrades should be handled separately following their respective guidelines.
49+
- Upgrading the Radius control plane using Helm directly. We can run `helm upgrade` on the Radius Helm installation but that is not going to put all the necessary pieces together for the control plane to work. Making this work is not in the scope of this work.
50+
- Zero-downtime control plane upgrades. While we aim to minimize disruption, guaranteeing absolutely no downtime for control plane components is not a goal for this initial release.
51+
- Automatic CLI upgrades. Users must manually update their local CLI version after upgrading the control plane.
4852

4953
## User profile and challenges
5054

@@ -57,10 +61,10 @@ The primary users of this feature are system administrators and DevOps engineers
5761

5862
### Challenge(s) faced by the user
5963

60-
- As of March 2025, Radius does not support in-place upgrades or upgrades via the Radius CLI. Users can achieve this using Helm, but this method does not update the local Radius CLI version to the desired version. The current steps are:
64+
- As of June 2025, Radius does not support in-place upgrades or upgrades via the Radius CLI. Users can achieve this using Helm, but this method does not update the local Radius CLI version to the desired version. The current steps are:
6165
- `helm install radius oci://ghcr.io/radius-project/helm-chart/radius --version 0.43.0`
6266
- `helm upgrade radius oci://ghcr.io/radius-project/helm-chart/radius --version 0.44.0`
63-
- This approach is cumbersome and does not provide a good user experience. Additionally, it does not automatically install dependencies like Contour and Dapr, requiring users to handle these installations manually.
67+
- This approach is cumbersome and does not provide a good user experience. Additionally, it does not automatically install dependencies like Contour, requiring users to handle these installations manually.
6468
- Another way to upgrade Radius to the desired version involves the following steps:
6569
- Uninstall the existing Radius installation by running `rad uninstall kubernetes`. This command doesn't delete any user data or the `radius-system` namespace.
6670
- Download and install the desired version locally. Ensure it is the correct version by running `rad version`.
@@ -85,43 +89,25 @@ After the implementation of this feature, the Radius CLI will provide a streamli
8589
# Basic upgrade to a specific version
8690
rad upgrade kubernetes --version v0.44.0
8791

88-
# Upgrade to the latest stable version
89-
rad upgrade kubernetes --version latest
90-
9192
# Upgrade with custom configuration values
9293
rad upgrade kubernetes --version v0.44.0 --set global.monitoring.enabled=true
9394

94-
# Perform a dry-run to simulate the upgrade without making changes
95-
rad upgrade kubernetes --version v0.44.0 --dry-run
96-
9795
# Upgrade with extended timeout for large deployments
9896
rad upgrade kubernetes --version v0.44.0 --timeout 600
9997
```
10098

10199
The upgrade process will:
102100

103-
1. **Pre-flight checks**: Detect existing installation, check versions, ensure the user is not attempting a downgrade, etc.
104-
2. **Fetch available chart versions**: Provide a list of known chart versions so the desired version that the users select is a valid one.
105-
3. **Dry-run** (when requested): Simulate the upgrade, logging steps without making changes. Also making sure that the upgrade will work. Helm has this feature available in the `helm upgrade` command: <https://helm.sh/docs/helm/helm_upgrade/>.
106-
4. **Snapshot**: Automatically back up current data (e.g., etcd, resources in the API server, or Postgres) before making changes.
107-
5. **Upgrade**: Apply necessary Helm changes (including timeouts, set args, etc.), optionally perform database migrations if needed.
108-
6. **Rollback** (on failure): If something goes wrong, use the snapshot to restore the prior state.
109-
7. **Post-upgrade checks**: Validate that new control plane components are healthy and confirm the upgrade was successful.
110-
111-
### Scenario 2: Upgrading the control plane using Helm (Non-Goal)
112-
113-
As mentioned above, users can install Radius using Helm, but dependencies like Contour and Dapr are not installed this way. Users can also run `helm upgrade` on a Helm installation to upgrade the version of Radius in the cluster. However, this is not an ideal solution and is considered a workaround because users will need to perform additional steps to achieve a healthy installation, similar to what they would have after running `rad install kubernetes` and `rad init`. The steps are as follows:
114-
115-
- `helm install radius oci://ghcr.io/radius-project/helm-chart/radius --version 0.43.0`
116-
- `helm upgrade radius oci://ghcr.io/radius-project/helm-chart/radius --version 0.44.0`
117-
118-
Users will still need to install Dapr and Contour and run `rad init` to achieve a similar behavior to what `rad upgrade kubernetes` would provide. I also think that the Radius CLI version needs to be updated to the desired version before running `rad init`.
101+
1. **Pre-flight checks**: Detect existing installation, check versions, verify version compatibility, validate cluster health, and ensure sufficient resources.
102+
2. **Distributed locking**: Acquire an upgrade lock to prevent concurrent data modifications.
103+
3. **Upgrade**: Apply necessary Helm changes (including timeouts, set args, etc.) in a rolling fashion to minimize downtime.
104+
4. **Health verification**: Validate that new control plane components are healthy and confirm the upgrade was successful.
105+
5. **Helm-based rollback (on failure)**: If something goes wrong, use Helm's built-in rollback capability to revert Kubernetes resources to their previous state
119106

120107
## Key dependencies and risks
121108

122109
- **Dependency Name**: Helm – The upgrade process relies on Helm for managing the upgrade process. Issues/concerns/risks: Compatibility with new versions of Helm.
123110
- **Dependency Name**: Contour – The upgrade process relies on Contour for routing requests. Issues/concerns/risks: Compatibility with new versions of Contour.
124-
- **Dependency Name**: Dapr – The upgrade process relies on Dapr. Issues/concerns/risks: Compatibility with new versions of Dapr.
125111

126112
## Key assumptions to test and questions to answer
127113

@@ -130,25 +116,9 @@ Users will still need to install Dapr and Contour and run `rad init` to achieve
130116
- **Version Compatibility:** If you are upgrading from a version of Radius that does not include the `rad upgrade kubernetes` command, you must first update to a release that provides the upgrade functionality. Older versions do not support automated upgrades in the CLI.
131117
- **User Permissions:** Users have the necessary permissions and access to perform upgrades, including cluster-admin roles where required.
132118
- **Data Integrity:** The upgrade process will maintain data integrity and prevent corruption during migration operations (particularly for database migrations) by using snapshots and the ability to restore in case of failure.
133-
- **Dependency Management:** We use a specific version of Contour and Dapr (<https://github.com/radius-project/radius/blob/main/pkg/cli/helm/cluster.go#L34>) and they can not be updated with the configuration. This ensures consistent dependency management. There is a case where the default chart versions may be updated in a newer version of Radius but that means that those versions of Contour and Dapr are probably tested with that new version of Radius and shouldn't provide problems during the upgrade from the current version and the desired version which is the newer version that we just discussed about.
119+
- **Dependency Management:** We use a specific version of Contour (<https://github.com/radius-project/radius/blob/main/pkg/cli/helm/cluster.go#L34>) and they can not be updated with the configuration. This ensures consistent dependency management. There is a case where the default chart versions may be updated in a newer version of Radius but that means that those versions of Contour are probably tested with that new version of Radius and shouldn't provide problems during the upgrade from the current version and the desired version which is the newer version that we just discussed about.
134120
- **Resource Requirements:** The upgrade process won't exceed available cluster resources during the transition period when both old and new components may be running simultaneously.
135121

136-
### Technical Questions to Resolve
137-
138-
- **Downgrade Support:**
139-
- Should we support downgrading to previous versions? If yes, what are the limitations?
140-
- How should we handle cases where users attempt to downgrade to versions that don't support the upgrade feature itself?
141-
- **Version Skipping:**
142-
- Can users skip multiple versions in a single upgrade (e.g., v0.40 → v0.44), or should we enforce incremental upgrades?
143-
- **Failure Recovery:**
144-
- What recovery mechanisms should be implemented if an upgrade fails mid-process?
145-
- How do we ensure that partial upgrades don't leave the system in an inconsistent state?
146-
- **User Experience:**
147-
- What are the common issues users face during the upgrade process, and how can we address them?
148-
- How should progress and status updates be communicated during long-running upgrades?
149-
- **Component Versions:**
150-
- Does `radius upgrade kubernetes` mean that all the components (Dapr, Contour, Postgres, etc.) will be upgraded? Will those versions be provided by Radius release or the user?
151-
152122
### Success Metrics & Validation
153123

154124
- **Upgrade Success Rate:** Define metrics to track successful vs. failed upgrades
@@ -189,38 +159,41 @@ After this scenario is implemented, I can upgrade the Radius control plane compo
189159

190160
## Key investments
191161

192-
### Feature 1
162+
### Feature 1: Preflight Check System
163+
164+
Implement comprehensive pre-upgrade checks to ensure compatibility and verify prerequisites:
193165

194-
Implement pre-upgrade and post-upgrade checks to ensure compatibility and verify the success of the upgrade. These checks are:
166+
1. **Version Compatibility:** Ensure the current version is behind the desired version to prevent downgrades.
167+
2. **Cluster Resource Check:** Verify the cluster has sufficient resources for the rolling upgrade.
168+
3. **Control Plane Health Check:** Confirm current installation is in a healthy state before proceeding.
169+
4. **Custom Configuration Validation:** Validate any custom parameters provided with the upgrade command.
195170

196-
1. **Pre-Upgrade Checks:**
197-
1. **Version Compatibility:** Ensure the current version is behind the desired version to prevent downgrades.
198-
2. **Snapshot Creation:** Automatically create a snapshot of the current data (e.g., etcd, resources in the API server, or Postgres) to ensure user data safety in case of rollback.
199-
2. **Post-Upgrade Checks:**
200-
1. **Component Health:** Verify that all upgraded control plane components are up and running.
171+
### Feature 2: Distributed Locking Mechanism
201172

202-
### Feature 2
173+
Implement a robust distributed locking system to prevent concurrent modifications during upgrades:
203174

204-
Provide clear documentation and guidance for users performing upgrades using the Radius CLI.
175+
1. **Data-store Level Locks**: Utilize etcd or PostgreSQL's native locking capabilities.
176+
2. **Heartbeat Mechanism**: Prevent stale locks through periodic renewal.
177+
3. **Force Override Option**: Allow administrators to release stale locks when necessary.
178+
4. **CLI Command Integration**: Update all data-modifying commands to check for active upgrade locks.
205179

206-
### Dry-Run Upgrade Process
180+
### Feature 3: Helm-based Upgrade and Rollback System
207181

208-
Implementing a dry-run option for the Radius control plane upgrades will allow users to simulate the upgrade process without making any actual changes. This helps in identifying potential issues and ensuring a smooth upgrade when executed for real. The dry-run process will include the following steps:
182+
Implement a reliable upgrade system using Helm with built-in rollback capability:
209183

210-
1. **Initiate Dry-Run**: User initiates the dry-run process using the command `rad upgrade --version 0.44 --dry-run`.
211-
2. **Simulate Upgrade**: The system simulates the upgrade process, performing all the steps without making any actual changes. Helm actually has a flag that we can use for the `helm upgrade` command.
212-
3. **Generate Report**: The system generates a report detailing the steps that would be taken during the actual upgrade, including any potential issues or conflicts.
213-
4. **Review Report**: User reviews the report to identify and address any potential issues before proceeding with the actual upgrade.
184+
1. **Helm Chart Management**: Enhanced wrapper around Helm's upgrade capabilities.
185+
2. **Component Health Verification**: System to verify all components are healthy after upgrade.
186+
3. **Automated Rollback**: Use Helm's built-in rollback capability if health checks fail.
187+
4. **Custom Configuration Support**: Apply user-provided configuration values during upgrade.
214188

215189
### Plan for `rad upgrade kubernetes`
216190

217191
The `rad upgrade kubernetes` command will be designed to facilitate the upgrade process for the Radius control plane components. Here is the plan for implementing the `rad upgrade kubernetes` command:
218192

219193
1. **Command Structure**: The `rad upgrade kubernetes` command will follow a similar structure to the `rad install kubernetes` command, with additional options for performing upgrades.
220194
2. **Pre-Upgrade Checks**: The command will perform pre-upgrade checks to ensure compatibility with the existing configuration and identify any potential issues.
221-
3. **Dry-Run Option**: The command will include a `--dry-run` option to simulate the upgrade process without making any actual changes. This will help users identify potential issues before performing the actual upgrade.
222-
4. **Upgrade Execution**: The command will execute the upgrade process, including updating the control plane components, applying any necessary database migrations, and updating configurations.
223-
5. **Post-Upgrade Checks**: The command will perform post-upgrade checks to verify the success of the upgrade and ensure that the system is functioning as expected.
224-
6. **Rollback Option**: The command will include a rollback option to revert to the previous version in case of any issues during the upgrade process.
195+
3. **Upgrade Execution**: The command will execute the upgrade process, including updating the control plane components, applying any necessary database migrations, and updating configurations.
196+
4. **Post-Upgrade Checks**: The command will perform post-upgrade checks to verify the success of the upgrade and ensure that the system is functioning as expected.
197+
5. **Rollback Option**: The command will include a rollback option to revert to the previous version in case of any issues during the upgrade process.
225198

226199
By following this plan, the `rad upgrade kubernetes` command will provide a reliable and user-friendly way to upgrade the Radius control plane components, ensuring compatibility with new versions and minimizing downtime and disruption.

0 commit comments

Comments
 (0)