Radius in-place upgrade design document #87

ytimocin · 2025-03-18T21:55:11Z

No description provided.

architecture/2025-03-upgrade-design-doc.md

ytimocin · 2025-04-07T22:08:33Z

architecture/2025-03-upgrade-design-doc.md

+
+- **Simplify upgrade process**: Provide a single CLI command (`rad upgrade kubernetes`) to upgrade Radius without manual reinstallation.
+- **Ensure data safety**: Implement automatic user data backups before upgrades and restore capability if failures occur.
+- **Minimize downtime**: Use rolling upgrades where possible to keep Radius control plane available during the upgrade process.


We can use rolling upgrade feature of Kubernetes to be able to revert the upgrade process if there is an issue with the new pods because older one will not be deleted until the new ones are healthy.

Wouldn't this require us to have the ability to roll back the database state changes? Otherwise new pods wouldn't be healthy until the change is made, and rolling back pods wouldn't work until the change is reverted.

architecture/2025-03-upgrade-design-doc.md

nicolejms · 2025-04-07T22:13:07Z

architecture/2025-03-upgrade-design-doc.md

+
+### Non goals
+
+- **Downgrade support**: The upgrade process is designed to move forward to newer versions only. Downgrading to previous versions is not supported.


If upgrade is "bad" we need an option to rollback

Do you mean bad upgrade but healthy pods?

I don't agree this is a non-goal. We need the ability to rollback even if the upgrade succeeds. If we introduce a regression, users must be able to rollback to a last known good state without waiting for us to fix the issue and release a new build.

Yes, I added that later in the document. I will remove this non-goal. As discussed with you yesterday, we will introduce a rad rollback kubernetes in one of the future iterations of this work.

lakshmimsft · 2025-04-07T22:14:01Z

architecture/2025-03-upgrade-design-doc.md

+
+2. **Resource requirements**: The Kubernetes cluster has sufficient compute resources (CPU, memory) to run both the existing and new version components during the rolling upgrade process.
+
+3. **CLI version compatibility**: User has a Radius CLI version that includes the `rad upgrade kubernetes` feature. While older CLIs can't perform upgrades, newer CLIs maintain backward compatibility with older control planes.


Can we discuss further if/how we are proposing backward-compatibility?

In one of the future versions, we will be keeping a list of most recent successful Radius installations and and introduce a new command for rollback to the most recent happy version.

this can get really thorny if the db schema changes and data has to be rolled back as well

sylvainsf · 2025-04-07T22:15:30Z

architecture/2025-03-upgrade-design-doc.md

+   - Define two new interfaces in the `components/database` package:
+     - `UserDataBackup`: Responsible for creating backups of user data before the upgrade.
+     - `UserDataRestore`: Responsible for restoring data from the backup in case of rollback.
+   - Design versioned backup formats to handle schema migrations between versions.


etcd has snapshots with a lock, but doesn't have a concept of a transaction lock where we could do a snapshot and perform the upgrade in the same lock so we would want to block writes at a higher level. In Postgres we can do a transaction with rollback preventing the need for a backup at all, or enable an optional backup but it seems redundant.

I was also thinking about this option of not adding the backup functionality by trying to solve the problem in the data layer. We can discuss further.

architecture/2025-03-upgrade-design-doc.md

nicolejms · 2025-04-07T22:17:16Z

architecture/2025-03-upgrade-design-doc.md

+
+1. If custom configuration parameters are invalid (or should it be ignored?)
+
+#### Scenario 3: Handling upgrade failure and recovery


How do we handle rollback after a schema change?

Depending on how we handle the schema changes, we are going to do a rollback. For example, some db migration tools require two files: up.sql and down.sql. We will call down.sql to do the rollback.

There's a distinct difference between a rollback and down migration. Rollback is automatic when the database change fails, down migration is something a user does to undo a database change.

This should be easy to do, for etcd:
https://etcd.io/docs/v3.5/op-guide/recovery/

We just need to take a snapshot and perhaps label it the old version.

Postgres would be similar. We should document them here as this is distinctly different from a rollback migration. There's some considerations as it will reflect the app graph at the time of the last deployment so we may want to document that a deployment after rollback is required.

willdavsmith · 2025-04-07T22:17:39Z

architecture/2025-03-upgrade-design-doc.md

+- **Version skipping limits**: Should we enforce incremental upgrades for certain major version jumps, or always allow direct version skipping?
+- **Upgrade notifications**: How should we notify users clearly about the CLI version mismatch after upgrading the control plane?
+- **Resource constraints**: How do we handle scenarios where the cluster lacks sufficient resources to perform a rolling upgrade?
+


One thing that I'm not seeing mentioned is how we handle breaking changes. we're still in 0.x versions, and sometimes breaking changes might happen in the resource specs or in other places

There's two kinds of breaking changes, and we can choose not to introduce destructive breaking changes. Changing a column/field from a string to an integer backed enum is a destructive change. Using a new column/field and deprecating use of the old one is non destructive but "breaking" in that reverting a version of the running control plan will not look at the correct info.

architecture/2025-03-upgrade-design-doc.md

nicolejms · 2025-04-07T22:22:07Z

architecture/2025-03-upgrade-design-doc.md

+}
+```
+
+This interface will be implemented (or existing will be improved) to handle version comparisons, prevent downgrades, and resolve the "latest" version tag to a specific version number. The implementation will (probably) connect to the GitHub API to fetch available release versions when needed.


How will we track information about "compatibility" between versions? Also, we need to support rollback even if the upgrade completed successfully.

Added that in the development plan

willtsai · 2025-04-07T22:22:52Z

Would upgrades via GitOps trigger this upgrade flow? I think it should :)

architecture/2025-03-upgrade-design-doc.md

ytimocin · 2025-04-07T22:24:29Z

Would upgrades via GitOps trigger this upgrade flow? I think it should :)

I can use @willdavsmith 's help on this question :)

nicolejms · 2025-04-07T22:25:28Z

architecture/2025-03-upgrade-design-doc.md

+
+Checks will include:
+
+1. Version compatibility verification


wouldn't we also want to validate permissions, resources, etc (i.e. all your assumptions are true)?

architecture/2025-03-upgrade-design-doc.md

nicolejms · 2025-04-07T22:30:09Z

architecture/2025-03-upgrade-design-doc.md

+2. **Graceful Timeouts**: All operations will respect user-defined or default timeouts
+3. **Automatic Rollback**: Failed upgrades trigger automatic restoration of previous state
+4. **Detailed Error Reporting**: Clear error messages with troubleshooting guidance
+5. **Idempotent Operations**: Commands can be safely retried after addressing issues


does this mean we'll rollback on every failure or that the system can have varying degrees of upgrade completed?

nicolejms · 2025-04-07T22:31:00Z

architecture/2025-03-upgrade-design-doc.md

+The upgrade process will implement the following error handling strategies:
+
+1. **Pre-flight Validation**: Catch incompatibility issues before starting the upgrade
+2. **Graceful Timeouts**: All operations will respect user-defined or default timeouts


how do we deal with timeouts? always rollback?

kachawla · 2025-04-07T22:07:58Z

architecture/2025-03-upgrade-design-doc.md

+
+1. **User permissions**: Users running the upgrade command have enough permissions on both the Kubernetes cluster and the Radius installation.
+
+2. **Resource requirements**: The Kubernetes cluster has sufficient compute resources (CPU, memory) to run both the existing and new version components during the rolling upgrade process.


What's the expected user experience during upgrade if resources are constrained?

Good point. Users may hit mid-upgrade failures with no clear information. To fix this, we can:

Update the Assumptions section (in docs or somewhere that the users can see) to call out about the minimum requirements for the rad upgrade kubernetes process

Add ResourceAvailability pre-flight check that can measure the resources in the cluster and abort the process if there is not enough

Improve the CLI UX so that on failure it can mention that there is not enough resources

We can apply all these steps and I will update the doc accordingly.

kachawla · 2025-04-07T22:10:33Z

architecture/2025-03-upgrade-design-doc.md

+
+   Example: If you have CLI v0.42 and Control Plane v0.42:
+
+   - You CAN upgrade to Control Plane v0.44 using the v0.42 CLI (if v0.42 CLI includes the upgrade feature)


Are there any edge that we should think about with this approach? A lot of the operations are currently performed on the client side, so I wonder if this would break in ways we haven't thought through.

architecture/2025-03-upgrade-design-doc.md

kachawla · 2025-04-07T22:20:01Z

architecture/2025-03-upgrade-design-doc.md

+**Exceptions:**
+
+1. If version check fails because the target is lower than current version
+2. If components fail health checks after upgrade


How does a user recover from this?

Good question. If the upgrade has successfully completed and components start acting weird after the complete of rad upgrade kubernetes, users can recover by the new command we will introduce: rad rollback kubernetes which will rollback to the most recent successful version.

kachawla · 2025-04-07T22:20:56Z

architecture/2025-03-upgrade-design-doc.md

+
+**Exceptions:**
+
+1. If custom configuration parameters are invalid (or should it be ignored?)


What's the benefit of ignoring invalid config?

I had added that as a note to myself. After playing around with the code, we shouldn't ignore and just error out.

kachawla · 2025-04-07T22:24:39Z

architecture/2025-03-upgrade-design-doc.md

+
+**Exceptions:**
+
+1. If the user data backup restoration fails (rare but possible)


What's the path to recovery from this state?

There may be 3 possible ways to recover from this state:

Manual rollback

Troubleshoot and retry

Let's please keep in mind that the user data will still be available in the data store. This is just the backup restoration failure we are talking about. It is not the actual data.

kachawla · 2025-04-07T22:25:30Z

architecture/2025-03-upgrade-design-doc.md

+**Exceptions:**
+
+1. If direct upgrade path isn't supported between versions
+1. If database migrations encounter issues (this may be the case when we introduce Postgres as the data store)


Why is this specific to postgres?

What I am talking about here is that we will have a migrations folder that will have all of our migrations (up and down files). For example: add_resources_table.up, add_resources_table.down, add_new_column_to_resources_table.up, and add_new_column_to_resources_table.down etc. We will need to find a way to do this in etcd.

kachawla · 2025-04-07T22:27:20Z

architecture/2025-03-upgrade-design-doc.md

+  CLI -->|Creates User Data Backup| Backup["User Data Backup"]
+  Backup -->|Restores on Failure| Restore["Restore Mechanism"]


These should happen on the server side, client doesn't store the data today.

kachawla · 2025-04-07T22:27:29Z

architecture/2025-03-upgrade-design-doc.md

+  CLI -->|Logs Progress| User["User"]
+  CLI -->|Performs Pre-flight Checks| PreFlight["Pre-flight Checks"]
+  PreFlight -->|Validates| KubernetesAPI
+  CLI -->|Creates User Data Backup| Backup["User Data Backup"]


Where is the backup stored?

Kubernetes objects may be too platform specific. But for now, I believe we are going to keep them in Kubernetes object like a PVC (https://kubernetes.io/docs/concepts/storage/persistent-volumes/)

architecture/2025-03-upgrade-design-doc.md

kachawla · 2025-04-07T22:45:14Z

architecture/2025-03-upgrade-design-doc.md

+
+We can utilize Kubernetes Lease objects (coordination.k8s.io/v1) for implementing the distributed locking mechanism (open to discussion and suggestions). Leases are purpose-built for this use case, providing built-in lease duration and automatic expiration capabilities. For more information, see: <https://kubernetes.io/docs/concepts/architecture/leases/>.
+
+Other CLI commands (`rad deploy app.bicep`, `rad delete app my-app` or other data-changing commands) that modify data will check for this lock before proceeding:


Locks should be enforced on the server side to prevent race conditions, as client-side locking may not effectively handle concurrent requests where client mutations and upgrades overlap.

It is going to be introduced in the data store layer.

willtsai · 2025-04-17T21:11:56Z

architecture/2025-03-upgrade-design-doc.md

+
+### Non goals
+
+- **Downgrade support**: The upgrade process is designed to move forward to newer versions only. Downgrading to previous versions is not supported.


one of the use cases I've heard is that an upgrade may go successfully from a Radius control plane perspective, but then subsequently result in test failures in other parts of the stack (e.g. if Radius is built into a developer platform that consists of many components downstream from Radius). how will we address or advise users for this?

In one of the future iterations, we will add a new command to rollback to the most recent successful version of Radius. A command like rad rollback kubernetes. Would that answer your question?

willtsai · 2025-04-17T21:15:07Z

architecture/2025-03-upgrade-design-doc.md

+   - An air-gapped environment is one where systems are physically isolated from unsecured networks like the public internet.
+   - These environments are common in high-security scenarios (military, financial, healthcare, government) where external network connectivity is restricted.


what would be our stance for upgrading Radius in an air-gapped environment? will there be an option to point a radius upgrade to use container images from predownloaded or custom registries?

Yes, exactly. We will be responsible for adding all the necessary parameters to the necessary commands so that users can point to their own registries and/or specific images.

Like the work we have been doing here: radius-project/radius#9189. And this is the sh file to run rad install w/o internet access: https://gist.github.com/ytimocin/8887d95ab1409562f4646fd30edb101c.

willtsai · 2025-04-17T21:20:35Z

architecture/2025-03-upgrade-design-doc.md

+
+### Goals
+
+- **Simplify upgrade process**: Provide a single CLI command (`rad upgrade kubernetes`) to upgrade Radius without manual reinstallation.


Will there be an option to upgrade via GitOps if the user manages their Radius installation via Flux?

To be honest, I haven't explored this option. If GitOps is using Helm upgrade behind the scenes, it may be not that difficult to add. But I would discuss this with @willdavsmith and decide what to do next.

brooke-hamilton

🚀 Nice document 🚀

brooke-hamilton · 2025-04-17T21:10:46Z

architecture/2025-03-upgrade-design-doc.md

+
+### High-Level Design Diagram
+
+```mermaid


This diagram appears to indicate that the user data backup runs in parallel to preflight checks but does not have to complete before the helm upgrade is applied. Is that the intended flow?

No, that is not right. I will update it.

brooke-hamilton · 2025-04-17T21:16:03Z

architecture/2025-03-upgrade-design-doc.md

+
+### Architecture Diagram
+
+```mermaid


Could this and the previous diagram be merged into one component diagram that shows relationships and responsibilities?

I will try to do that too. I am a beginner in using MermaidJS.

brooke-hamilton · 2025-04-17T21:16:19Z

architecture/2025-03-upgrade-design-doc.md

+
+### Detailed Design
+
+```mermaid


This is great - very clear process.

brooke-hamilton · 2025-04-17T21:17:49Z

architecture/2025-03-upgrade-design-doc.md

+}
+```
+
+This interface will be implemented (or existing will be improved) to handle version comparisons, prevent downgrades, and resolve the "latest" version tag to a specific version number. The implementation will (probably) connect to the GitHub API to fetch available release versions when needed.


We need a way to check the version in an airgapped environment without the GitHub API. Maybe use the currently configured OCI container repo?

We can list the Helm releases using HelmClient in go and get the deployed release version from that.

lakshmimsft · 2025-04-17T21:27:11Z

architecture/2025-03-upgrade-design-doc.md

+```go
+type UserDataBackup interface {
+    // Creates a backup of all user application metadata and configurations
+    BackupUserData(ctx context.Context) (BackupID string, err error)


will there be a pattern/format for this ID. Does it tie to a specific backup call run and do we need to identify it later?

superbeeny · 2025-04-17T21:15:35Z

architecture/2025-03-upgrade-design-doc.md

+
+   Example: If you have CLI v0.42 and Control Plane v0.42:
+
+   - You CAN upgrade to Control Plane v0.44 using the v0.42 CLI (if v0.42 CLI includes the upgrade feature)


Can we update the CLI as part of the process?

superbeeny · 2025-04-17T21:17:51Z

architecture/2025-03-upgrade-design-doc.md

+3. **Lock Mechanism**: Data-store-level distributed locking system
+4. **Backup/Restore**: User data protection system using ConfigMaps/PVs
+5. **Helm Integration**: Enhanced wrapper around Helm's upgrade capabilities
+6. **Health Verification**: Component readiness and health check mechanisms


Are we thinking beyond healthchecks for validation that the upgrade was a success, how are we thinking about validating data post schema changes?

superbeeny · 2025-04-17T21:20:06Z

architecture/2025-03-upgrade-design-doc.md

+Initiating Radius upgrade from v0.40.0 to v0.44.0 (latest stable)...
+Pre-flight checks:
+  ✓ Valid version target
+  ✓ Multiple version jump detected (v0.40.0 → v0.44.0)


would the simplest option here be applying each release in turn?

superbeeny · 2025-04-17T21:23:38Z

architecture/2025-03-upgrade-design-doc.md

+
+2. **Migration Plan Bundles**
+
+   - Generate a composite plan when skipping (e.g. v0.42 → v0.45):  


Similar to my comment above, does it make sense to apply each of the versions in turn until the requested version is reached

superbeeny · 2025-04-17T21:24:26Z

architecture/2025-03-upgrade-design-doc.md

+3. **User Confirmation & Dry-Run**
+
+   - Prompt the user with a clear “You're jumping from A→D. We'll run migrations for B and C in turn. Proceed?”
+   - Offer a `--dry-run` mode that prints the full step list without making changes.


Note: this will only be able to model the configuration changes and not necessarily data changes

kachawla · 2025-04-17T21:14:20Z

architecture/2025-03-upgrade-design-doc.md

+**Stale‑lock detection:** each lock has a TTL/heartbeat; expired leases are auto‑cleaned before AcquireLock.
+Force cleanup: --force flag allows manual removal of stale/orphaned locks.
+
+Usage in CLI commands:


#87 (comment) would still be an issue if the we are checking the lock in the CLI.

kachawla · 2025-04-17T21:15:13Z

architecture/2025-03-upgrade-design-doc.md

+
+**Pre-flight Check System:**
+
+Pre-flight checks run before any changes are made to ensure the upgrade can proceed safely.


Where will these checks run?

kachawla · 2025-04-17T21:15:23Z

architecture/2025-03-upgrade-design-doc.md

+
+1. Version compatibility verification
+2. Existing installation detection
+3. Database connectivity


What do we mean by this?

kachawla · 2025-04-17T21:15:40Z

architecture/2025-03-upgrade-design-doc.md

+1. Version compatibility verification
+2. Existing installation detection
+3. Database connectivity
+4. Custom configuration validation


Can you expand a bit on what is custom configuration?

kachawla · 2025-04-17T21:16:47Z

architecture/2025-03-upgrade-design-doc.md

+Rather than taking complete snapshots of the underlying databases (etcd/PostgreSQL), we'll implement a more targeted approach that backs up only the user application metadata and configuration that Radius manages:
+
+- **Included in backup**: User application, environment, recipe definitions, and all other resources that the user has deployed/added via Radius.
+- **Not included in backup**: Anything other than user data in the data store.


Can you share example of this data that's not user data?

kachawla · 2025-04-17T21:22:04Z

architecture/2025-03-upgrade-design-doc.md

+    Values map[string]interface{} // Custom configuration values
+    Timeout time.Duration // Maximum time allowed for upgrade
+
+    EnableUserDataBackup  bool // Whether automatic user data backup is enabled


Could you share a scenario when this will be disabled?

kachawla · 2025-04-17T21:25:45Z

architecture/2025-03-upgrade-design-doc.md

+
+### API design (if applicable)
+
+No specific REST API addition is necessary.


design-notes/architecture/2025-03-upgrade-design-doc.md

Lines 462 to 468 in 9cd00ec

UpgradeRadius(ctx context.Context, options UpgradeOptions) error

// Returns the current status of an ongoing upgrade

GetUpgradeStatus(ctx context.Context) (UpgradeStatus, error)

// Validates that an upgrade to the target version is possible

ValidateUpgrade(ctx context.Context, targetVersion string) error

I gathered that these will be new APIs, did I misunderstand it?

architecture/2025-03-upgrade-design-doc.md

superbeeny · 2025-04-17T21:37:45Z

Nice, very comprehensive and lots of food for thought

Signed-off-by: ytimocin <ytimocin@microsoft.com>

ytimocin force-pushed the ytimocin/design/upgrades branch 3 times, most recently from 3e7f609 to ad7830d Compare March 19, 2025 00:42

ytimocin force-pushed the ytimocin/design/upgrades branch from ad7830d to f3bb521 Compare March 26, 2025 23:33

ytimocin force-pushed the ytimocin/design/upgrades branch 2 times, most recently from aac2eba to 75f640a Compare April 7, 2025 18:24

ytimocin marked this pull request as ready for review April 7, 2025 19:16

ytimocin requested review from a team as code owners April 7, 2025 19:16

ytimocin force-pushed the ytimocin/design/upgrades branch from 75f640a to eb45973 Compare April 7, 2025 21:54

ytimocin commented Apr 7, 2025

View reviewed changes

architecture/2025-03-upgrade-design-doc.md Outdated Show resolved Hide resolved

sylvainsf reviewed Apr 7, 2025

View reviewed changes

architecture/2025-03-upgrade-design-doc.md Outdated Show resolved Hide resolved

sylvainsf reviewed Apr 7, 2025

View reviewed changes

architecture/2025-03-upgrade-design-doc.md Show resolved Hide resolved

ytimocin commented Apr 7, 2025

View reviewed changes

willdavsmith reviewed Apr 7, 2025

View reviewed changes

architecture/2025-03-upgrade-design-doc.md Outdated Show resolved Hide resolved

nicolejms reviewed Apr 7, 2025

View reviewed changes

architecture/2025-03-upgrade-design-doc.md Show resolved Hide resolved

sylvainsf reviewed Apr 7, 2025

View reviewed changes

architecture/2025-03-upgrade-design-doc.md Show resolved Hide resolved

nicolejms reviewed Apr 7, 2025

View reviewed changes

architecture/2025-03-upgrade-design-doc.md Show resolved Hide resolved

sylvainsf reviewed Apr 7, 2025

View reviewed changes

architecture/2025-03-upgrade-design-doc.md Show resolved Hide resolved

nicolejms reviewed Apr 7, 2025

View reviewed changes

architecture/2025-03-upgrade-design-doc.md Show resolved Hide resolved

nicolejms reviewed Apr 7, 2025

View reviewed changes

lakshmimsft reviewed Apr 7, 2025

View reviewed changes

sylvainsf reviewed Apr 7, 2025

View reviewed changes

architecture/2025-03-upgrade-design-doc.md Show resolved Hide resolved

nicolejms reviewed Apr 7, 2025

View reviewed changes

willdavsmith reviewed Apr 7, 2025

View reviewed changes

willtsai reviewed Apr 7, 2025

View reviewed changes

architecture/2025-03-upgrade-design-doc.md Show resolved Hide resolved

architecture/2025-03-upgrade-design-doc.md Show resolved Hide resolved

nicolejms reviewed Apr 7, 2025

View reviewed changes

architecture/2025-03-upgrade-design-doc.md Show resolved Hide resolved

nicolejms reviewed Apr 7, 2025

View reviewed changes

architecture/2025-03-upgrade-design-doc.md Show resolved Hide resolved

nicolejms reviewed Apr 7, 2025

View reviewed changes

architecture/2025-03-upgrade-design-doc.md Outdated Show resolved Hide resolved

nicolejms reviewed Apr 7, 2025

View reviewed changes

architecture/2025-03-upgrade-design-doc.md Outdated Show resolved Hide resolved

lakshmimsft reviewed Apr 7, 2025

View reviewed changes

architecture/2025-03-upgrade-design-doc.md Outdated Show resolved Hide resolved

nicolejms reviewed Apr 7, 2025

View reviewed changes

kachawla reviewed Apr 7, 2025

View reviewed changes

ytimocin force-pushed the ytimocin/design/upgrades branch 3 times, most recently from 312b31d to 9cd00ec Compare April 17, 2025 21:09

willtsai reviewed Apr 17, 2025

View reviewed changes

brooke-hamilton reviewed Apr 17, 2025

View reviewed changes

lakshmimsft reviewed Apr 17, 2025

View reviewed changes

superbeeny reviewed Apr 17, 2025

View reviewed changes

kachawla reviewed Apr 17, 2025

View reviewed changes

ytimocin added 5 commits May 12, 2025 14:21

Feature specification for in-place upgrade of Radius

073a05d

Signed-off-by: ytimocin <ytimocin@microsoft.com>

Radius in-place upgrade design document

5e702e6

Signed-off-by: ytimocin <ytimocin@microsoft.com>

Addressing feedback from the team

f443eb8

Signed-off-by: ytimocin <ytimocin@microsoft.com>

Making some more changes and additions

9b614e1

Signed-off-by: ytimocin <ytimocin@microsoft.com>

Address feedback

748836b

Signed-off-by: ytimocin <ytimocin@microsoft.com>

ytimocin force-pushed the ytimocin/design/upgrades branch from bd857f2 to 748836b Compare May 12, 2025 21:21

Addressing feedback

e27279a

Signed-off-by: ytimocin <ytimocin@microsoft.com>

ytimocin force-pushed the ytimocin/design/upgrades branch from 2697578 to e27279a Compare May 12, 2025 21:48

ytimocin mentioned this pull request Jun 16, 2025

Adding Config and Helm Preflight checks radius-project/radius#9741

Merged

12 tasks


		### Non goals

		- Downgrade support: The upgrade process is designed to move forward to newer versions only. Downgrading to previous versions is not supported.


		2. Resource requirements: The Kubernetes cluster has sufficient compute resources (CPU, memory) to run both the existing and new version components during the rolling upgrade process.

		3. CLI version compatibility: User has a Radius CLI version that includes the `rad upgrade kubernetes` feature. While older CLIs can't perform upgrades, newer CLIs maintain backward compatibility with older control planes.


		1. If custom configuration parameters are invalid (or should it be ignored?)

		#### Scenario 3: Handling upgrade failure and recovery


		1. User permissions: Users running the upgrade command have enough permissions on both the Kubernetes cluster and the Radius installation.

		2. Resource requirements: The Kubernetes cluster has sufficient compute resources (CPU, memory) to run both the existing and new version components during the rolling upgrade process.


		Example: If you have CLI v0.42 and Control Plane v0.42:

		- You CAN upgrade to Control Plane v0.44 using the v0.42 CLI (if v0.42 CLI includes the upgrade feature)


		Exceptions:

		1. If custom configuration parameters are invalid (or should it be ignored?)


		Exceptions:

		1. If the user data backup restoration fails (rare but possible)

		CLI -->\|Creates User Data Backup\| Backup["User Data Backup"]
		Backup -->\|Restores on Failure\| Restore["Restore Mechanism"]


		We can utilize Kubernetes Lease objects (coordination.k8s.io/v1) for implementing the distributed locking mechanism (open to discussion and suggestions). Leases are purpose-built for this use case, providing built-in lease duration and automatic expiration capabilities. For more information, see: <https://kubernetes.io/docs/concepts/architecture/leases/>.

		Other CLI commands (`rad deploy app.bicep`, `rad delete app my-app` or other data-changing commands) that modify data will check for this lock before proceeding:

		- An air-gapped environment is one where systems are physically isolated from unsecured networks like the public internet.
		- These environments are common in high-security scenarios (military, financial, healthcare, government) where external network connectivity is restricted.


		### Goals

		- Simplify upgrade process: Provide a single CLI command (`rad upgrade kubernetes`) to upgrade Radius without manual reinstallation.


		2. Migration Plan Bundles

		- Generate a composite plan when skipping (e.g. v0.42 → v0.45):


		Pre-flight Check System:

		Pre-flight checks run before any changes are made to ensure the upgrade can proceed safely.


		### API design (if applicable)

		No specific REST API addition is necessary.

	UpgradeRadius(ctx context.Context, options UpgradeOptions) error

	// Returns the current status of an ongoing upgrade
	GetUpgradeStatus(ctx context.Context) (UpgradeStatus, error)

	// Validates that an upgrade to the target version is possible
	ValidateUpgrade(ctx context.Context, targetVersion string) error

Radius in-place upgrade design document #87

Are you sure you want to change the base?

Radius in-place upgrade design document #87

Uh oh!

Conversation

ytimocin commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicolejms Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nicolejms Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

willtsai commented Apr 7, 2025

Uh oh!

Uh oh!

ytimocin commented Apr 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ytimocin commented Mar 18, 2025 •

edited

Loading

nicolejms Apr 17, 2025 •

edited

Loading

nicolejms Apr 7, 2025 •

edited

Loading