You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: architecture/2025-03-upgrade-feature-spec.md
+46-73Lines changed: 46 additions & 73 deletions
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@
6
6
7
7
The Radius in-place upgrade feature aims to streamline the process of upgrading the control plane components of Radius, ensuring minimal downtime and disruption. **Importantly, user applications deployed through Radius continue running without interruption during the upgrade process, as Radius only maintains metadata about these applications and does not manage their runtime execution.**
8
8
9
-
**Radius Components (as of Mar 2025):**
9
+
**Radius Components (as of June 2025):**
10
10
11
11
- Universal Control Plane
12
12
- Radius Deployment Engine
@@ -17,34 +17,38 @@ The Radius in-place upgrade feature aims to streamline the process of upgrading
17
17
18
18
**Dependencies:**
19
19
20
-
- Dapr
21
20
- Contour
22
-
-Postgres
21
+
-etcd (current datastore)
23
22
24
23
**Key Features:**
25
24
26
-
-**Compatibility:** Ensures compatibility with new versions.
27
-
-**Data Safety:**Takes snapshots before the upgrade to safeguard user data, with automatic rollback in case of failure.
25
+
-**Compatibility:** Ensures compatibility with new versions through robust preflight checks.
26
+
-**Safety:**Utilizes Helm's built-in rollback capability for recovery in case of failure (with full data backup/restore planned for future versions).
28
27
-**Seamless Experience:** Provides a smooth and reliable upgrade process with minimal downtime.
29
28
-**Application Continuity:** User applications remain operational throughout the upgrade process, as Radius only manages deployment metadata, not runtime execution.
29
+
-**Distributed Locking:** Prevents concurrent data modifications during upgrades to maintain system integrity.
30
30
31
31
This feature will significantly improve the user experience by automating the upgrade process, reducing the risk of errors, and maintaining system stability.
32
32
33
33
### Top level goals
34
34
35
-
- Deliver a seamless and intuitive upgrade experience for users.
35
+
- Deliver a seamless and intuitive upgrade experience for users through a single `rad upgrade kubernetes` command.
36
36
- Ensure a smooth and reliable upgrade process for the components and dependencies of Radius.
37
37
- Minimize downtime and disruption during upgrades, ensuring that the Radius Control Plane is available as soon as possible.
38
-
- Ensure user data safety during the upgrade process by taking snapshots and providing rollback capabilities.
38
+
- Implement robust preflight checks to prevent upgrades in unsuitable conditions.
39
+
- Implement Helm-based rollback capability to recover from failed upgrades.
39
40
- Provide clear documentation and guidance for users performing upgrades.
40
-
-Maintain system performance during the upgrade process to avoid significant impact on running applications.
41
+
-Implement a data-store-level lock mechanism to prevent concurrent modifications during upgrades.
41
42
42
43
### Non-goals (out of scope)
43
44
44
-
- Downgrades are not supported. The upgrade process is designed to move forward to newer versions only.
45
-
- The upgrade process does not include features for managing upgrades across multiple clusters simultaneously as Radius doesn't support multiple clusters per installation as of March 2025.
46
-
- Upgrading major versions of dependencies like Postgres, Dapr, or Contour is not covered by this process. These upgrades should be handled separately following their respective guidelines.
47
-
- Upgrading the Radius control plane using Helm. We can run `helm upgrade` on the Radius Helm installation but that is not going to put all the necessary pieces together for the control plane to work. Making this work is not in the scope of this work.
45
+
- Downgrades are not supported in the initial version. The upgrade process is designed to move forward to newer versions only. Please see the design document and the development plan for more details.
46
+
- User data backup and restore will be implemented in a future version.
47
+
- The upgrade process does not include features for managing upgrades across multiple clusters simultaneously as Radius doesn't support multiple clusters per installation as of June 2025.
48
+
- Upgrading major versions of dependencies like Contour is not covered by this process. These upgrades should be handled separately following their respective guidelines.
49
+
- Upgrading the Radius control plane using Helm directly. We can run `helm upgrade` on the Radius Helm installation but that is not going to put all the necessary pieces together for the control plane to work. Making this work is not in the scope of this work.
50
+
- Zero-downtime control plane upgrades. While we aim to minimize disruption, guaranteeing absolutely no downtime for control plane components is not a goal for this initial release.
51
+
- Automatic CLI upgrades. Users must manually update their local CLI version after upgrading the control plane.
48
52
49
53
## User profile and challenges
50
54
@@ -57,10 +61,10 @@ The primary users of this feature are system administrators and DevOps engineers
57
61
58
62
### Challenge(s) faced by the user
59
63
60
-
- As of March 2025, Radius does not support in-place upgrades or upgrades via the Radius CLI. Users can achieve this using Helm, but this method does not update the local Radius CLI version to the desired version. The current steps are:
64
+
- As of June 2025, Radius does not support in-place upgrades or upgrades via the Radius CLI. Users can achieve this using Helm, but this method does not update the local Radius CLI version to the desired version. The current steps are:
- This approach is cumbersome and does not provide a good user experience. Additionally, it does not automatically install dependencies like Contour and Dapr, requiring users to handle these installations manually.
67
+
- This approach is cumbersome and does not provide a good user experience. Additionally, it does not automatically install dependencies like Contour, requiring users to handle these installations manually.
64
68
- Another way to upgrade Radius to the desired version involves the following steps:
65
69
- Uninstall the existing Radius installation by running `rad uninstall kubernetes`. This command doesn't delete any user data or the `radius-system` namespace.
66
70
- Download and install the desired version locally. Ensure it is the correct version by running `rad version`.
@@ -85,43 +89,25 @@ After the implementation of this feature, the Radius CLI will provide a streamli
85
89
# Basic upgrade to a specific version
86
90
rad upgrade kubernetes --version v0.44.0
87
91
88
-
# Upgrade to the latest stable version
89
-
rad upgrade kubernetes --version latest
90
-
91
92
# Upgrade with custom configuration values
92
93
rad upgrade kubernetes --version v0.44.0 --set global.monitoring.enabled=true
93
94
94
-
# Perform a dry-run to simulate the upgrade without making changes
95
-
rad upgrade kubernetes --version v0.44.0 --dry-run
96
-
97
95
# Upgrade with extended timeout for large deployments
98
96
rad upgrade kubernetes --version v0.44.0 --timeout 600
99
97
```
100
98
101
99
The upgrade process will:
102
100
103
-
1.**Pre-flight checks**: Detect existing installation, check versions, ensure the user is not attempting a downgrade, etc.
104
-
2.**Fetch available chart versions**: Provide a list of known chart versions so the desired version that the users select is a valid one.
105
-
3.**Dry-run** (when requested): Simulate the upgrade, logging steps without making changes. Also making sure that the upgrade will work. Helm has this feature available in the `helm upgrade` command: <https://helm.sh/docs/helm/helm_upgrade/>.
106
-
4.**Snapshot**: Automatically back up current data (e.g., etcd, resources in the API server, or Postgres) before making changes.
107
-
5.**Upgrade**: Apply necessary Helm changes (including timeouts, set args, etc.), optionally perform database migrations if needed.
108
-
6.**Rollback** (on failure): If something goes wrong, use the snapshot to restore the prior state.
109
-
7.**Post-upgrade checks**: Validate that new control plane components are healthy and confirm the upgrade was successful.
110
-
111
-
### Scenario 2: Upgrading the control plane using Helm (Non-Goal)
112
-
113
-
As mentioned above, users can install Radius using Helm, but dependencies like Contour and Dapr are not installed this way. Users can also run `helm upgrade` on a Helm installation to upgrade the version of Radius in the cluster. However, this is not an ideal solution and is considered a workaround because users will need to perform additional steps to achieve a healthy installation, similar to what they would have after running `rad install kubernetes` and `rad init`. The steps are as follows:
Users will still need to install Dapr and Contour and run `rad init` to achieve a similar behavior to what `rad upgrade kubernetes` would provide. I also think that the Radius CLI version needs to be updated to the desired version before running `rad init`.
101
+
1.**Pre-flight checks**: Detect existing installation, check versions, verify version compatibility, validate cluster health, and ensure sufficient resources.
102
+
2.**Distributed locking**: Acquire an upgrade lock to prevent concurrent data modifications.
103
+
3.**Upgrade**: Apply necessary Helm changes (including timeouts, set args, etc.) in a rolling fashion to minimize downtime.
104
+
4.**Health verification**: Validate that new control plane components are healthy and confirm the upgrade was successful.
105
+
5.**Helm-based rollback (on failure)**: If something goes wrong, use Helm's built-in rollback capability to revert Kubernetes resources to their previous state
119
106
120
107
## Key dependencies and risks
121
108
122
109
-**Dependency Name**: Helm – The upgrade process relies on Helm for managing the upgrade process. Issues/concerns/risks: Compatibility with new versions of Helm.
123
110
-**Dependency Name**: Contour – The upgrade process relies on Contour for routing requests. Issues/concerns/risks: Compatibility with new versions of Contour.
124
-
-**Dependency Name**: Dapr – The upgrade process relies on Dapr. Issues/concerns/risks: Compatibility with new versions of Dapr.
125
111
126
112
## Key assumptions to test and questions to answer
127
113
@@ -130,25 +116,9 @@ Users will still need to install Dapr and Contour and run `rad init` to achieve
130
116
-**Version Compatibility:** If you are upgrading from a version of Radius that does not include the `rad upgrade kubernetes` command, you must first update to a release that provides the upgrade functionality. Older versions do not support automated upgrades in the CLI.
131
117
-**User Permissions:** Users have the necessary permissions and access to perform upgrades, including cluster-admin roles where required.
132
118
-**Data Integrity:** The upgrade process will maintain data integrity and prevent corruption during migration operations (particularly for database migrations) by using snapshots and the ability to restore in case of failure.
133
-
-**Dependency Management:** We use a specific version of Contour and Dapr (<https://github.com/radius-project/radius/blob/main/pkg/cli/helm/cluster.go#L34>) and they can not be updated with the configuration. This ensures consistent dependency management. There is a case where the default chart versions may be updated in a newer version of Radius but that means that those versions of Contour and Dapr are probably tested with that new version of Radius and shouldn't provide problems during the upgrade from the current version and the desired version which is the newer version that we just discussed about.
119
+
-**Dependency Management:** We use a specific version of Contour (<https://github.com/radius-project/radius/blob/main/pkg/cli/helm/cluster.go#L34>) and they can not be updated with the configuration. This ensures consistent dependency management. There is a case where the default chart versions may be updated in a newer version of Radius but that means that those versions of Contour are probably tested with that new version of Radius and shouldn't provide problems during the upgrade from the current version and the desired version which is the newer version that we just discussed about.
134
120
-**Resource Requirements:** The upgrade process won't exceed available cluster resources during the transition period when both old and new components may be running simultaneously.
135
121
136
-
### Technical Questions to Resolve
137
-
138
-
-**Downgrade Support:**
139
-
- Should we support downgrading to previous versions? If yes, what are the limitations?
140
-
- How should we handle cases where users attempt to downgrade to versions that don't support the upgrade feature itself?
141
-
-**Version Skipping:**
142
-
- Can users skip multiple versions in a single upgrade (e.g., v0.40 → v0.44), or should we enforce incremental upgrades?
143
-
-**Failure Recovery:**
144
-
- What recovery mechanisms should be implemented if an upgrade fails mid-process?
145
-
- How do we ensure that partial upgrades don't leave the system in an inconsistent state?
146
-
-**User Experience:**
147
-
- What are the common issues users face during the upgrade process, and how can we address them?
148
-
- How should progress and status updates be communicated during long-running upgrades?
149
-
-**Component Versions:**
150
-
- Does `radius upgrade kubernetes` mean that all the components (Dapr, Contour, Postgres, etc.) will be upgraded? Will those versions be provided by Radius release or the user?
151
-
152
122
### Success Metrics & Validation
153
123
154
124
-**Upgrade Success Rate:** Define metrics to track successful vs. failed upgrades
@@ -189,38 +159,41 @@ After this scenario is implemented, I can upgrade the Radius control plane compo
189
159
190
160
## Key investments
191
161
192
-
### Feature 1
162
+
### Feature 1: Preflight Check System
163
+
164
+
Implement comprehensive pre-upgrade checks to ensure compatibility and verify prerequisites:
193
165
194
-
Implement pre-upgrade and post-upgrade checks to ensure compatibility and verify the success of the upgrade. These checks are:
166
+
1.**Version Compatibility:** Ensure the current version is behind the desired version to prevent downgrades.
167
+
2.**Cluster Resource Check:** Verify the cluster has sufficient resources for the rolling upgrade.
168
+
3.**Control Plane Health Check:** Confirm current installation is in a healthy state before proceeding.
169
+
4.**Custom Configuration Validation:** Validate any custom parameters provided with the upgrade command.
195
170
196
-
1.**Pre-Upgrade Checks:**
197
-
1.**Version Compatibility:** Ensure the current version is behind the desired version to prevent downgrades.
198
-
2.**Snapshot Creation:** Automatically create a snapshot of the current data (e.g., etcd, resources in the API server, or Postgres) to ensure user data safety in case of rollback.
199
-
2.**Post-Upgrade Checks:**
200
-
1.**Component Health:** Verify that all upgraded control plane components are up and running.
171
+
### Feature 2: Distributed Locking Mechanism
201
172
202
-
### Feature 2
173
+
Implement a robust distributed locking system to prevent concurrent modifications during upgrades:
203
174
204
-
Provide clear documentation and guidance for users performing upgrades using the Radius CLI.
175
+
1.**Data-store Level Locks**: Utilize etcd or PostgreSQL's native locking capabilities.
176
+
2.**Heartbeat Mechanism**: Prevent stale locks through periodic renewal.
177
+
3.**Force Override Option**: Allow administrators to release stale locks when necessary.
178
+
4.**CLI Command Integration**: Update all data-modifying commands to check for active upgrade locks.
205
179
206
-
### Dry-Run Upgrade Process
180
+
### Feature 3: Helm-based Upgrade and Rollback System
207
181
208
-
Implementing a dry-run option for the Radius control plane upgrades will allow users to simulate the upgrade process without making any actual changes. This helps in identifying potential issues and ensuring a smooth upgrade when executed for real. The dry-run process will include the following steps:
182
+
Implement a reliable upgrade system using Helm with built-in rollback capability:
209
183
210
-
1.**Initiate Dry-Run**: User initiates the dry-run process using the command `rad upgrade --version 0.44 --dry-run`.
211
-
2.**Simulate Upgrade**: The system simulates the upgrade process, performing all the steps without making any actual changes. Helm actually has a flag that we can use for the `helm upgrade` command.
212
-
3.**Generate Report**: The system generates a report detailing the steps that would be taken during the actual upgrade, including any potential issues or conflicts.
213
-
4.**Review Report**: User reviews the report to identify and address any potential issues before proceeding with the actual upgrade.
184
+
1.**Helm Chart Management**: Enhanced wrapper around Helm's upgrade capabilities.
185
+
2.**Component Health Verification**: System to verify all components are healthy after upgrade.
186
+
3.**Automated Rollback**: Use Helm's built-in rollback capability if health checks fail.
187
+
4.**Custom Configuration Support**: Apply user-provided configuration values during upgrade.
214
188
215
189
### Plan for `rad upgrade kubernetes`
216
190
217
191
The `rad upgrade kubernetes` command will be designed to facilitate the upgrade process for the Radius control plane components. Here is the plan for implementing the `rad upgrade kubernetes` command:
218
192
219
193
1.**Command Structure**: The `rad upgrade kubernetes` command will follow a similar structure to the `rad install kubernetes` command, with additional options for performing upgrades.
220
194
2.**Pre-Upgrade Checks**: The command will perform pre-upgrade checks to ensure compatibility with the existing configuration and identify any potential issues.
221
-
3.**Dry-Run Option**: The command will include a `--dry-run` option to simulate the upgrade process without making any actual changes. This will help users identify potential issues before performing the actual upgrade.
222
-
4.**Upgrade Execution**: The command will execute the upgrade process, including updating the control plane components, applying any necessary database migrations, and updating configurations.
223
-
5.**Post-Upgrade Checks**: The command will perform post-upgrade checks to verify the success of the upgrade and ensure that the system is functioning as expected.
224
-
6.**Rollback Option**: The command will include a rollback option to revert to the previous version in case of any issues during the upgrade process.
195
+
3.**Upgrade Execution**: The command will execute the upgrade process, including updating the control plane components, applying any necessary database migrations, and updating configurations.
196
+
4.**Post-Upgrade Checks**: The command will perform post-upgrade checks to verify the success of the upgrade and ensure that the system is functioning as expected.
197
+
5.**Rollback Option**: The command will include a rollback option to revert to the previous version in case of any issues during the upgrade process.
225
198
226
199
By following this plan, the `rad upgrade kubernetes` command will provide a reliable and user-friendly way to upgrade the Radius control plane components, ensuring compatibility with new versions and minimizing downtime and disruption.
0 commit comments