Skip to content

Radius in-place upgrade design document #87

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ytimocin
Copy link
Contributor

@ytimocin ytimocin commented Mar 18, 2025

No description provided.

@ytimocin ytimocin force-pushed the ytimocin/design/upgrades branch 3 times, most recently from 3e7f609 to ad7830d Compare March 19, 2025 00:42
@ytimocin ytimocin force-pushed the ytimocin/design/upgrades branch from ad7830d to f3bb521 Compare March 26, 2025 23:33
@ytimocin ytimocin force-pushed the ytimocin/design/upgrades branch 2 times, most recently from aac2eba to 75f640a Compare April 7, 2025 18:24
@ytimocin ytimocin marked this pull request as ready for review April 7, 2025 19:16
@ytimocin ytimocin requested review from a team as code owners April 7, 2025 19:16
@ytimocin ytimocin force-pushed the ytimocin/design/upgrades branch from 75f640a to eb45973 Compare April 7, 2025 21:54

- **Simplify upgrade process**: Provide a single CLI command (`rad upgrade kubernetes`) to upgrade Radius without manual reinstallation.
- **Ensure data safety**: Implement automatic user data backups before upgrades and restore capability if failures occur.
- **Minimize downtime**: Use rolling upgrades where possible to keep Radius control plane available during the upgrade process.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use rolling upgrade feature of Kubernetes to be able to revert the upgrade process if there is an issue with the new pods because older one will not be deleted until the new ones are healthy.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this require us to have the ability to roll back the database state changes? Otherwise new pods wouldn't be healthy until the change is made, and rolling back pods wouldn't work until the change is reverted.


### Non goals

- **Downgrade support**: The upgrade process is designed to move forward to newer versions only. Downgrading to previous versions is not supported.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If upgrade is "bad" we need an option to rollback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean bad upgrade but healthy pods?

Copy link

@nicolejms nicolejms Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree this is a non-goal. We need the ability to rollback even if the upgrade succeeds. If we introduce a regression, users must be able to rollback to a last known good state without waiting for us to fix the issue and release a new build.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added that later in the document. I will remove this non-goal. As discussed with you yesterday, we will introduce a rad rollback kubernetes in one of the future iterations of this work.


2. **Resource requirements**: The Kubernetes cluster has sufficient compute resources (CPU, memory) to run both the existing and new version components during the rolling upgrade process.

3. **CLI version compatibility**: User has a Radius CLI version that includes the `rad upgrade kubernetes` feature. While older CLIs can't perform upgrades, newer CLIs maintain backward compatibility with older control planes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we discuss further if/how we are proposing backward-compatibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one of the future versions, we will be keeping a list of most recent successful Radius installations and and introduce a new command for rollback to the most recent happy version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can get really thorny if the db schema changes and data has to be rolled back as well

- Define two new interfaces in the `components/database` package:
- `UserDataBackup`: Responsible for creating backups of user data before the upgrade.
- `UserDataRestore`: Responsible for restoring data from the backup in case of rollback.
- Design versioned backup formats to handle schema migrations between versions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etcd has snapshots with a lock, but doesn't have a concept of a transaction lock where we could do a snapshot and perform the upgrade in the same lock so we would want to block writes at a higher level. In Postgres we can do a transaction with rollback preventing the need for a backup at all, or enable an optional backup but it seems redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking about this option of not adding the backup functionality by trying to solve the problem in the data layer. We can discuss further.


1. If custom configuration parameters are invalid (or should it be ignored?)

#### Scenario 3: Handling upgrade failure and recovery

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we handle rollback after a schema change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how we handle the schema changes, we are going to do a rollback. For example, some db migration tools require two files: up.sql and down.sql. We will call down.sql to do the rollback.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a distinct difference between a rollback and down migration. Rollback is automatic when the database change fails, down migration is something a user does to undo a database change.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be easy to do, for etcd:
https://etcd.io/docs/v3.5/op-guide/recovery/

We just need to take a snapshot and perhaps label it the old version.

Postgres would be similar. We should document them here as this is distinctly different from a rollback migration. There's some considerations as it will reflect the app graph at the time of the last deployment so we may want to document that a deployment after rollback is required.

- **Version skipping limits**: Should we enforce incremental upgrades for certain major version jumps, or always allow direct version skipping?
- **Upgrade notifications**: How should we notify users clearly about the CLI version mismatch after upgrading the control plane?
- **Resource constraints**: How do we handle scenarios where the cluster lacks sufficient resources to perform a rolling upgrade?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that I'm not seeing mentioned is how we handle breaking changes. we're still in 0.x versions, and sometimes breaking changes might happen in the resource specs or in other places

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's two kinds of breaking changes, and we can choose not to introduce destructive breaking changes. Changing a column/field from a string to an integer backed enum is a destructive change. Using a new column/field and deprecating use of the old one is non destructive but "breaking" in that reverting a version of the running control plan will not look at the correct info.

}
```

This interface will be implemented (or existing will be improved) to handle version comparisons, prevent downgrades, and resolve the "latest" version tag to a specific version number. The implementation will (probably) connect to the GitHub API to fetch available release versions when needed.
Copy link

@nicolejms nicolejms Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we track information about "compatibility" between versions? Also, we need to support rollback even if the upgrade completed successfully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added that in the development plan

@willtsai
Copy link
Contributor

willtsai commented Apr 7, 2025

Would upgrades via GitOps trigger this upgrade flow? I think it should :)

@ytimocin
Copy link
Contributor Author

ytimocin commented Apr 7, 2025

Would upgrades via GitOps trigger this upgrade flow? I think it should :)

I can use @willdavsmith 's help on this question :)


Checks will include:

1. Version compatibility verification

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't we also want to validate permissions, resources, etc (i.e. all your assumptions are true)?

2. **Graceful Timeouts**: All operations will respect user-defined or default timeouts
3. **Automatic Rollback**: Failed upgrades trigger automatic restoration of previous state
4. **Detailed Error Reporting**: Clear error messages with troubleshooting guidance
5. **Idempotent Operations**: Commands can be safely retried after addressing issues

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean we'll rollback on every failure or that the system can have varying degrees of upgrade completed?

The upgrade process will implement the following error handling strategies:

1. **Pre-flight Validation**: Catch incompatibility issues before starting the upgrade
2. **Graceful Timeouts**: All operations will respect user-defined or default timeouts

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we deal with timeouts? always rollback?


1. **User permissions**: Users running the upgrade command have enough permissions on both the Kubernetes cluster and the Radius installation.

2. **Resource requirements**: The Kubernetes cluster has sufficient compute resources (CPU, memory) to run both the existing and new version components during the rolling upgrade process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected user experience during upgrade if resources are constrained?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Users may hit mid-upgrade failures with no clear information. To fix this, we can:

  • Update the Assumptions section (in docs or somewhere that the users can see) to call out about the minimum requirements for the rad upgrade kubernetes process
  • Add ResourceAvailability pre-flight check that can measure the resources in the cluster and abort the process if there is not enough
  • Improve the CLI UX so that on failure it can mention that there is not enough resources

We can apply all these steps and I will update the doc accordingly.


Example: If you have CLI v0.42 and Control Plane v0.42:

- You CAN upgrade to Control Plane v0.44 using the v0.42 CLI (if v0.42 CLI includes the upgrade feature)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any edge that we should think about with this approach? A lot of the operations are currently performed on the client side, so I wonder if this would break in ways we haven't thought through.

**Exceptions:**

1. If version check fails because the target is lower than current version
2. If components fail health checks after upgrade
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does a user recover from this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. If the upgrade has successfully completed and components start acting weird after the complete of rad upgrade kubernetes, users can recover by the new command we will introduce: rad rollback kubernetes which will rollback to the most recent successful version.


**Exceptions:**

1. If custom configuration parameters are invalid (or should it be ignored?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of ignoring invalid config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had added that as a note to myself. After playing around with the code, we shouldn't ignore and just error out.


**Exceptions:**

1. If the user data backup restoration fails (rare but possible)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the path to recovery from this state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be 3 possible ways to recover from this state:

  1. Manual rollback
  2. Troubleshoot and retry

Let's please keep in mind that the user data will still be available in the data store. This is just the backup restoration failure we are talking about. It is not the actual data.

**Exceptions:**

1. If direct upgrade path isn't supported between versions
1. If database migrations encounter issues (this may be the case when we introduce Postgres as the data store)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this specific to postgres?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am talking about here is that we will have a migrations folder that will have all of our migrations (up and down files). For example: add_resources_table.up, add_resources_table.down, add_new_column_to_resources_table.up, and add_new_column_to_resources_table.down etc. We will need to find a way to do this in etcd.

Comment on lines 288 to 289
CLI -->|Creates User Data Backup| Backup["User Data Backup"]
Backup -->|Restores on Failure| Restore["Restore Mechanism"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should happen on the server side, client doesn't store the data today.

CLI -->|Logs Progress| User["User"]
CLI -->|Performs Pre-flight Checks| PreFlight["Pre-flight Checks"]
PreFlight -->|Validates| KubernetesAPI
CLI -->|Creates User Data Backup| Backup["User Data Backup"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the backup stored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes objects may be too platform specific. But for now, I believe we are going to keep them in Kubernetes object like a PVC (https://kubernetes.io/docs/concepts/storage/persistent-volumes/)


We can utilize Kubernetes Lease objects (coordination.k8s.io/v1) for implementing the distributed locking mechanism (open to discussion and suggestions). Leases are purpose-built for this use case, providing built-in lease duration and automatic expiration capabilities. For more information, see: <https://kubernetes.io/docs/concepts/architecture/leases/>.

Other CLI commands (`rad deploy app.bicep`, `rad delete app my-app` or other data-changing commands) that modify data will check for this lock before proceeding:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locks should be enforced on the server side to prevent race conditions, as client-side locking may not effectively handle concurrent requests where client mutations and upgrades overlap.

Copy link
Contributor Author

@ytimocin ytimocin Apr 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is going to be introduced in the data store layer.

@ytimocin ytimocin force-pushed the ytimocin/design/upgrades branch 3 times, most recently from 312b31d to 9cd00ec Compare April 17, 2025 21:09

### Non goals

- **Downgrade support**: The upgrade process is designed to move forward to newer versions only. Downgrading to previous versions is not supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of the use cases I've heard is that an upgrade may go successfully from a Radius control plane perspective, but then subsequently result in test failures in other parts of the stack (e.g. if Radius is built into a developer platform that consists of many components downstream from Radius). how will we address or advise users for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one of the future iterations, we will add a new command to rollback to the most recent successful version of Radius. A command like rad rollback kubernetes. Would that answer your question?

Comment on lines +40 to +41
- An air-gapped environment is one where systems are physically isolated from unsecured networks like the public internet.
- These environments are common in high-security scenarios (military, financial, healthcare, government) where external network connectivity is restricted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be our stance for upgrading Radius in an air-gapped environment? will there be an option to point a radius upgrade to use container images from predownloaded or custom registries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. We will be responsible for adding all the necessary parameters to the necessary commands so that users can point to their own registries and/or specific images.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the work we have been doing here: radius-project/radius#9189. And this is the sh file to run rad install w/o internet access: https://gist.github.com/ytimocin/8887d95ab1409562f4646fd30edb101c.


### Goals

- **Simplify upgrade process**: Provide a single CLI command (`rad upgrade kubernetes`) to upgrade Radius without manual reinstallation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be an option to upgrade via GitOps if the user manages their Radius installation via Flux?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I haven't explored this option. If GitOps is using Helm upgrade behind the scenes, it may be not that difficult to add. But I would discuss this with @willdavsmith and decide what to do next.

Copy link
Member

@brooke-hamilton brooke-hamilton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Nice document 🚀


### High-Level Design Diagram

```mermaid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diagram appears to indicate that the user data backup runs in parallel to preflight checks but does not have to complete before the helm upgrade is applied. Is that the intended flow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that is not right. I will update it.


### Architecture Diagram

```mermaid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this and the previous diagram be merged into one component diagram that shows relationships and responsibilities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to do that too. I am a beginner in using MermaidJS.


### Detailed Design

```mermaid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great - very clear process.

}
```

This interface will be implemented (or existing will be improved) to handle version comparisons, prevent downgrades, and resolve the "latest" version tag to a specific version number. The implementation will (probably) connect to the GitHub API to fetch available release versions when needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a way to check the version in an airgapped environment without the GitHub API. Maybe use the currently configured OCI container repo?

Copy link
Contributor Author

@ytimocin ytimocin Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can list the Helm releases using HelmClient in go and get the deployed release version from that.

```go
type UserDataBackup interface {
// Creates a backup of all user application metadata and configurations
BackupUserData(ctx context.Context) (BackupID string, err error)
Copy link
Contributor

@lakshmimsft lakshmimsft Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will there be a pattern/format for this ID. Does it tie to a specific backup call run and do we need to identify it later?


Example: If you have CLI v0.42 and Control Plane v0.42:

- You CAN upgrade to Control Plane v0.44 using the v0.42 CLI (if v0.42 CLI includes the upgrade feature)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the CLI as part of the process?

3. **Lock Mechanism**: Data-store-level distributed locking system
4. **Backup/Restore**: User data protection system using ConfigMaps/PVs
5. **Helm Integration**: Enhanced wrapper around Helm's upgrade capabilities
6. **Health Verification**: Component readiness and health check mechanisms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we thinking beyond healthchecks for validation that the upgrade was a success, how are we thinking about validating data post schema changes?

Initiating Radius upgrade from v0.40.0 to v0.44.0 (latest stable)...
Pre-flight checks:
✓ Valid version target
✓ Multiple version jump detected (v0.40.0 → v0.44.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the simplest option here be applying each release in turn?


2. **Migration Plan Bundles**

- Generate a composite plan when skipping (e.g. v0.42 → v0.45):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment above, does it make sense to apply each of the versions in turn until the requested version is reached

3. **User Confirmation & Dry-Run**

- Prompt the user with a clear “You're jumping from A→D. We'll run migrations for B and C in turn. Proceed?”
- Offer a `--dry-run` mode that prints the full step list without making changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this will only be able to model the configuration changes and not necessarily data changes

**Stale‑lock detection:** each lock has a TTL/heartbeat; expired leases are auto‑cleaned before AcquireLock.
Force cleanup: --force flag allows manual removal of stale/orphaned locks.

Usage in CLI commands:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#87 (comment) would still be an issue if the we are checking the lock in the CLI.


**Pre-flight Check System:**

Pre-flight checks run before any changes are made to ensure the upgrade can proceed safely.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where will these checks run?


1. Version compatibility verification
2. Existing installation detection
3. Database connectivity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we mean by this?

1. Version compatibility verification
2. Existing installation detection
3. Database connectivity
4. Custom configuration validation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand a bit on what is custom configuration?

Rather than taking complete snapshots of the underlying databases (etcd/PostgreSQL), we'll implement a more targeted approach that backs up only the user application metadata and configuration that Radius manages:

- **Included in backup**: User application, environment, recipe definitions, and all other resources that the user has deployed/added via Radius.
- **Not included in backup**: Anything other than user data in the data store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share example of this data that's not user data?

Values map[string]interface{} // Custom configuration values
Timeout time.Duration // Maximum time allowed for upgrade

EnableUserDataBackup bool // Whether automatic user data backup is enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share a scenario when this will be disabled?


### API design (if applicable)

No specific REST API addition is necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpgradeRadius(ctx context.Context, options UpgradeOptions) error
// Returns the current status of an ongoing upgrade
GetUpgradeStatus(ctx context.Context) (UpgradeStatus, error)
// Validates that an upgrade to the target version is possible
ValidateUpgrade(ctx context.Context, targetVersion string) error

I gathered that these will be new APIs, did I misunderstand it?

@superbeeny
Copy link
Contributor

Nice, very comprehensive and lots of food for thought

ytimocin added 5 commits May 12, 2025 14:21
Signed-off-by: ytimocin <ytimocin@microsoft.com>
Signed-off-by: ytimocin <ytimocin@microsoft.com>
Signed-off-by: ytimocin <ytimocin@microsoft.com>
Signed-off-by: ytimocin <ytimocin@microsoft.com>
Signed-off-by: ytimocin <ytimocin@microsoft.com>
@ytimocin ytimocin force-pushed the ytimocin/design/upgrades branch from bd857f2 to 748836b Compare May 12, 2025 21:21
Signed-off-by: ytimocin <ytimocin@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants