-
Notifications
You must be signed in to change notification settings - Fork 16
Radius in-place upgrade design document #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3e7f609
to
ad7830d
Compare
ad7830d
to
f3bb521
Compare
aac2eba
to
75f640a
Compare
75f640a
to
eb45973
Compare
|
||
- **Simplify upgrade process**: Provide a single CLI command (`rad upgrade kubernetes`) to upgrade Radius without manual reinstallation. | ||
- **Ensure data safety**: Implement automatic user data backups before upgrades and restore capability if failures occur. | ||
- **Minimize downtime**: Use rolling upgrades where possible to keep Radius control plane available during the upgrade process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use rolling upgrade feature of Kubernetes to be able to revert the upgrade process if there is an issue with the new pods because older one will not be deleted until the new ones are healthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this require us to have the ability to roll back the database state changes? Otherwise new pods wouldn't be healthy until the change is made, and rolling back pods wouldn't work until the change is reverted.
|
||
### Non goals | ||
|
||
- **Downgrade support**: The upgrade process is designed to move forward to newer versions only. Downgrading to previous versions is not supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If upgrade is "bad" we need an option to rollback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean bad upgrade but healthy pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree this is a non-goal. We need the ability to rollback even if the upgrade succeeds. If we introduce a regression, users must be able to rollback to a last known good state without waiting for us to fix the issue and release a new build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I added that later in the document. I will remove this non-goal. As discussed with you yesterday, we will introduce a rad rollback kubernetes
in one of the future iterations of this work.
|
||
2. **Resource requirements**: The Kubernetes cluster has sufficient compute resources (CPU, memory) to run both the existing and new version components during the rolling upgrade process. | ||
|
||
3. **CLI version compatibility**: User has a Radius CLI version that includes the `rad upgrade kubernetes` feature. While older CLIs can't perform upgrades, newer CLIs maintain backward compatibility with older control planes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we discuss further if/how we are proposing backward-compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In one of the future versions, we will be keeping a list of most recent successful Radius installations and and introduce a new command for rollback to the most recent happy version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can get really thorny if the db schema changes and data has to be rolled back as well
- Define two new interfaces in the `components/database` package: | ||
- `UserDataBackup`: Responsible for creating backups of user data before the upgrade. | ||
- `UserDataRestore`: Responsible for restoring data from the backup in case of rollback. | ||
- Design versioned backup formats to handle schema migrations between versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
etcd has snapshots with a lock, but doesn't have a concept of a transaction lock where we could do a snapshot and perform the upgrade in the same lock so we would want to block writes at a higher level. In Postgres we can do a transaction with rollback preventing the need for a backup at all, or enable an optional backup but it seems redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking about this option of not adding the backup functionality by trying to solve the problem in the data layer. We can discuss further.
|
||
1. If custom configuration parameters are invalid (or should it be ignored?) | ||
|
||
#### Scenario 3: Handling upgrade failure and recovery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we handle rollback after a schema change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how we handle the schema changes, we are going to do a rollback. For example, some db migration tools require two files: up.sql and down.sql. We will call down.sql to do the rollback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a distinct difference between a rollback and down migration. Rollback is automatic when the database change fails, down migration is something a user does to undo a database change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be easy to do, for etcd:
https://etcd.io/docs/v3.5/op-guide/recovery/
We just need to take a snapshot and perhaps label it the old version.
Postgres would be similar. We should document them here as this is distinctly different from a rollback migration. There's some considerations as it will reflect the app graph at the time of the last deployment so we may want to document that a deployment after rollback is required.
- **Version skipping limits**: Should we enforce incremental upgrades for certain major version jumps, or always allow direct version skipping? | ||
- **Upgrade notifications**: How should we notify users clearly about the CLI version mismatch after upgrading the control plane? | ||
- **Resource constraints**: How do we handle scenarios where the cluster lacks sufficient resources to perform a rolling upgrade? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that I'm not seeing mentioned is how we handle breaking changes. we're still in 0.x versions, and sometimes breaking changes might happen in the resource specs or in other places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's two kinds of breaking changes, and we can choose not to introduce destructive breaking changes. Changing a column/field from a string to an integer backed enum is a destructive change. Using a new column/field and deprecating use of the old one is non destructive but "breaking" in that reverting a version of the running control plan will not look at the correct info.
} | ||
``` | ||
|
||
This interface will be implemented (or existing will be improved) to handle version comparisons, prevent downgrades, and resolve the "latest" version tag to a specific version number. The implementation will (probably) connect to the GitHub API to fetch available release versions when needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will we track information about "compatibility" between versions? Also, we need to support rollback even if the upgrade completed successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added that in the development plan
Would upgrades via GitOps trigger this upgrade flow? I think it should :) |
I can use @willdavsmith 's help on this question :) |
|
||
Checks will include: | ||
|
||
1. Version compatibility verification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't we also want to validate permissions, resources, etc (i.e. all your assumptions are true)?
2. **Graceful Timeouts**: All operations will respect user-defined or default timeouts | ||
3. **Automatic Rollback**: Failed upgrades trigger automatic restoration of previous state | ||
4. **Detailed Error Reporting**: Clear error messages with troubleshooting guidance | ||
5. **Idempotent Operations**: Commands can be safely retried after addressing issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean we'll rollback on every failure or that the system can have varying degrees of upgrade completed?
The upgrade process will implement the following error handling strategies: | ||
|
||
1. **Pre-flight Validation**: Catch incompatibility issues before starting the upgrade | ||
2. **Graceful Timeouts**: All operations will respect user-defined or default timeouts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we deal with timeouts? always rollback?
|
||
1. **User permissions**: Users running the upgrade command have enough permissions on both the Kubernetes cluster and the Radius installation. | ||
|
||
2. **Resource requirements**: The Kubernetes cluster has sufficient compute resources (CPU, memory) to run both the existing and new version components during the rolling upgrade process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the expected user experience during upgrade if resources are constrained?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Users may hit mid-upgrade failures with no clear information. To fix this, we can:
- Update the Assumptions section (in docs or somewhere that the users can see) to call out about the minimum requirements for the
rad upgrade kubernetes
process - Add ResourceAvailability pre-flight check that can measure the resources in the cluster and abort the process if there is not enough
- Improve the CLI UX so that on failure it can mention that there is not enough resources
We can apply all these steps and I will update the doc accordingly.
|
||
Example: If you have CLI v0.42 and Control Plane v0.42: | ||
|
||
- You CAN upgrade to Control Plane v0.44 using the v0.42 CLI (if v0.42 CLI includes the upgrade feature) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any edge that we should think about with this approach? A lot of the operations are currently performed on the client side, so I wonder if this would break in ways we haven't thought through.
**Exceptions:** | ||
|
||
1. If version check fails because the target is lower than current version | ||
2. If components fail health checks after upgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does a user recover from this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. If the upgrade has successfully completed and components start acting weird after the complete of rad upgrade kubernetes
, users can recover by the new command we will introduce: rad rollback kubernetes
which will rollback to the most recent successful version.
|
||
**Exceptions:** | ||
|
||
1. If custom configuration parameters are invalid (or should it be ignored?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of ignoring invalid config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had added that as a note to myself. After playing around with the code, we shouldn't ignore and just error out.
|
||
**Exceptions:** | ||
|
||
1. If the user data backup restoration fails (rare but possible) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the path to recovery from this state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be 3 possible ways to recover from this state:
- Manual rollback
- Troubleshoot and retry
Let's please keep in mind that the user data will still be available in the data store. This is just the backup restoration failure we are talking about. It is not the actual data.
**Exceptions:** | ||
|
||
1. If direct upgrade path isn't supported between versions | ||
1. If database migrations encounter issues (this may be the case when we introduce Postgres as the data store) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this specific to postgres?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I am talking about here is that we will have a migrations folder that will have all of our migrations (up and down files). For example: add_resources_table.up, add_resources_table.down, add_new_column_to_resources_table.up, and add_new_column_to_resources_table.down etc. We will need to find a way to do this in etcd.
CLI -->|Creates User Data Backup| Backup["User Data Backup"] | ||
Backup -->|Restores on Failure| Restore["Restore Mechanism"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should happen on the server side, client doesn't store the data today.
CLI -->|Logs Progress| User["User"] | ||
CLI -->|Performs Pre-flight Checks| PreFlight["Pre-flight Checks"] | ||
PreFlight -->|Validates| KubernetesAPI | ||
CLI -->|Creates User Data Backup| Backup["User Data Backup"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the backup stored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes objects may be too platform specific. But for now, I believe we are going to keep them in Kubernetes object like a PVC (https://kubernetes.io/docs/concepts/storage/persistent-volumes/)
|
||
We can utilize Kubernetes Lease objects (coordination.k8s.io/v1) for implementing the distributed locking mechanism (open to discussion and suggestions). Leases are purpose-built for this use case, providing built-in lease duration and automatic expiration capabilities. For more information, see: <https://kubernetes.io/docs/concepts/architecture/leases/>. | ||
|
||
Other CLI commands (`rad deploy app.bicep`, `rad delete app my-app` or other data-changing commands) that modify data will check for this lock before proceeding: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locks should be enforced on the server side to prevent race conditions, as client-side locking may not effectively handle concurrent requests where client mutations and upgrades overlap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is going to be introduced in the data store layer.
312b31d
to
9cd00ec
Compare
|
||
### Non goals | ||
|
||
- **Downgrade support**: The upgrade process is designed to move forward to newer versions only. Downgrading to previous versions is not supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of the use cases I've heard is that an upgrade may go successfully from a Radius control plane perspective, but then subsequently result in test failures in other parts of the stack (e.g. if Radius is built into a developer platform that consists of many components downstream from Radius). how will we address or advise users for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In one of the future iterations, we will add a new command to rollback to the most recent successful version of Radius. A command like rad rollback kubernetes
. Would that answer your question?
- An air-gapped environment is one where systems are physically isolated from unsecured networks like the public internet. | ||
- These environments are common in high-security scenarios (military, financial, healthcare, government) where external network connectivity is restricted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be our stance for upgrading Radius in an air-gapped environment? will there be an option to point a radius upgrade to use container images from predownloaded or custom registries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. We will be responsible for adding all the necessary parameters to the necessary commands so that users can point to their own registries and/or specific images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the work we have been doing here: radius-project/radius#9189. And this is the sh file to run rad install
w/o internet access: https://gist.github.com/ytimocin/8887d95ab1409562f4646fd30edb101c.
|
||
### Goals | ||
|
||
- **Simplify upgrade process**: Provide a single CLI command (`rad upgrade kubernetes`) to upgrade Radius without manual reinstallation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be an option to upgrade via GitOps if the user manages their Radius installation via Flux?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I haven't explored this option. If GitOps is using Helm upgrade behind the scenes, it may be not that difficult to add. But I would discuss this with @willdavsmith and decide what to do next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 Nice document 🚀
|
||
### High-Level Design Diagram | ||
|
||
```mermaid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diagram appears to indicate that the user data backup runs in parallel to preflight checks but does not have to complete before the helm upgrade is applied. Is that the intended flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that is not right. I will update it.
|
||
### Architecture Diagram | ||
|
||
```mermaid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this and the previous diagram be merged into one component diagram that shows relationships and responsibilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to do that too. I am a beginner in using MermaidJS.
|
||
### Detailed Design | ||
|
||
```mermaid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great - very clear process.
} | ||
``` | ||
|
||
This interface will be implemented (or existing will be improved) to handle version comparisons, prevent downgrades, and resolve the "latest" version tag to a specific version number. The implementation will (probably) connect to the GitHub API to fetch available release versions when needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a way to check the version in an airgapped environment without the GitHub API. Maybe use the currently configured OCI container repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can list the Helm releases using HelmClient in go and get the deployed release version from that.
```go | ||
type UserDataBackup interface { | ||
// Creates a backup of all user application metadata and configurations | ||
BackupUserData(ctx context.Context) (BackupID string, err error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will there be a pattern/format for this ID. Does it tie to a specific backup call run and do we need to identify it later?
|
||
Example: If you have CLI v0.42 and Control Plane v0.42: | ||
|
||
- You CAN upgrade to Control Plane v0.44 using the v0.42 CLI (if v0.42 CLI includes the upgrade feature) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update the CLI as part of the process?
3. **Lock Mechanism**: Data-store-level distributed locking system | ||
4. **Backup/Restore**: User data protection system using ConfigMaps/PVs | ||
5. **Helm Integration**: Enhanced wrapper around Helm's upgrade capabilities | ||
6. **Health Verification**: Component readiness and health check mechanisms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we thinking beyond healthchecks for validation that the upgrade was a success, how are we thinking about validating data post schema changes?
Initiating Radius upgrade from v0.40.0 to v0.44.0 (latest stable)... | ||
Pre-flight checks: | ||
✓ Valid version target | ||
✓ Multiple version jump detected (v0.40.0 → v0.44.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would the simplest option here be applying each release in turn?
|
||
2. **Migration Plan Bundles** | ||
|
||
- Generate a composite plan when skipping (e.g. v0.42 → v0.45): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my comment above, does it make sense to apply each of the versions in turn until the requested version is reached
3. **User Confirmation & Dry-Run** | ||
|
||
- Prompt the user with a clear “You're jumping from A→D. We'll run migrations for B and C in turn. Proceed?” | ||
- Offer a `--dry-run` mode that prints the full step list without making changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this will only be able to model the configuration changes and not necessarily data changes
**Stale‑lock detection:** each lock has a TTL/heartbeat; expired leases are auto‑cleaned before AcquireLock. | ||
Force cleanup: --force flag allows manual removal of stale/orphaned locks. | ||
|
||
Usage in CLI commands: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#87 (comment) would still be an issue if the we are checking the lock in the CLI.
|
||
**Pre-flight Check System:** | ||
|
||
Pre-flight checks run before any changes are made to ensure the upgrade can proceed safely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where will these checks run?
|
||
1. Version compatibility verification | ||
2. Existing installation detection | ||
3. Database connectivity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we mean by this?
1. Version compatibility verification | ||
2. Existing installation detection | ||
3. Database connectivity | ||
4. Custom configuration validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand a bit on what is custom configuration?
Rather than taking complete snapshots of the underlying databases (etcd/PostgreSQL), we'll implement a more targeted approach that backs up only the user application metadata and configuration that Radius manages: | ||
|
||
- **Included in backup**: User application, environment, recipe definitions, and all other resources that the user has deployed/added via Radius. | ||
- **Not included in backup**: Anything other than user data in the data store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share example of this data that's not user data?
Values map[string]interface{} // Custom configuration values | ||
Timeout time.Duration // Maximum time allowed for upgrade | ||
|
||
EnableUserDataBackup bool // Whether automatic user data backup is enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you share a scenario when this will be disabled?
|
||
### API design (if applicable) | ||
|
||
No specific REST API addition is necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
design-notes/architecture/2025-03-upgrade-design-doc.md
Lines 462 to 468 in 9cd00ec
UpgradeRadius(ctx context.Context, options UpgradeOptions) error | |
// Returns the current status of an ongoing upgrade | |
GetUpgradeStatus(ctx context.Context) (UpgradeStatus, error) | |
// Validates that an upgrade to the target version is possible | |
ValidateUpgrade(ctx context.Context, targetVersion string) error |
I gathered that these will be new APIs, did I misunderstand it?
Nice, very comprehensive and lots of food for thought |
Signed-off-by: ytimocin <ytimocin@microsoft.com>
Signed-off-by: ytimocin <ytimocin@microsoft.com>
Signed-off-by: ytimocin <ytimocin@microsoft.com>
Signed-off-by: ytimocin <ytimocin@microsoft.com>
Signed-off-by: ytimocin <ytimocin@microsoft.com>
bd857f2
to
748836b
Compare
Signed-off-by: ytimocin <ytimocin@microsoft.com>
2697578
to
e27279a
Compare
No description provided.