Skip to content

Commit bd857f2

Browse files
committed
Address feedback
Signed-off-by: ytimocin <ytimocin@microsoft.com>
1 parent 9cd00ec commit bd857f2

File tree

1 file changed

+99
-78
lines changed

1 file changed

+99
-78
lines changed

architecture/2025-03-upgrade-design-doc.md

Lines changed: 99 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -97,8 +97,6 @@ Initiating Radius upgrade from v0.44.0 to v0.45.0...
9797
Pre-flight checks:
9898
✓ Valid version target
9999
✓ Compatible upgrade path
100-
Creating backup of current user data...
101-
✓ Backup created successfully
102100
Upgrading control plane components:
103101
✓ Universal Control Plane
104102
✓ Deployment Engine
@@ -115,10 +113,9 @@ Note: Your local Radius CLI is still v0.44.0. To upgrade your CLI, download the
115113
**Result:**
116114

117115
1. Pre-flight checks validate the upgrade is possible
118-
1. System automatically creates a user data backup for recovery
119-
1. All control plane components are upgraded in sequence
120-
1. Post-upgrade verification confirms system health
121-
1. User is notified about the CLI version mismatch
116+
2. All control plane components are upgraded in sequence
117+
3. Post-upgrade verification confirms system health
118+
4. User is notified about the CLI version mismatch
122119

123120
**Exceptions:**
124121

@@ -143,8 +140,6 @@ Pre-flight checks:
143140
✓ Valid version target
144141
✓ Compatible upgrade path
145142
✓ Custom configuration validated
146-
Creating backup of current user data...
147-
✓ Backup created successfully
148143
Upgrading control plane components with custom configuration:
149144
✓ Universal Control Plane
150145
✓ Deployment Engine
@@ -184,16 +179,13 @@ Initiating Radius upgrade from v0.43.0 to v0.44.0...
184179
Pre-flight checks:
185180
✓ Valid version target
186181
✓ Compatible upgrade path
187-
Creating backup of current user data...
188-
✓ Backup created successfully
189182
Upgrading control plane components:
190183
✓ Universal Control Plane
191184
✓ Deployment Engine
192185
✗ Applications Resource Provider (ERROR: Container image pull failed)
193186

194187
ERROR: Upgrade failed during Applications Resource Provider update.
195188
Initiating automatic rollback to v0.43.0...
196-
✓ Restoring from backup (Not sure if this is needed)
197189
✓ Universal Control Plane reverted
198190
✓ Deployment Engine reverted
199191
✓ System verification complete
@@ -206,13 +198,13 @@ Review Kubernetes events and logs for more details on the failure.
206198
**Result:**
207199

208200
1. System detects failure during the upgrade process
209-
1. Automatic restore is initiated using the pre-upgrade backup
210-
1. All components are restored to their previous state
211-
1. User is informed of the failure and suggested next steps
201+
2. Helm-based rollback is initiated to revert Kubernetes resources
202+
3. Control plane components are reverted to their previous version
203+
4. User is informed of the failure and suggested next steps
212204

213205
**Exceptions:**
214206

215-
1. If the user data backup restoration fails (rare but possible)
207+
1. If Helm rollback fails (would require manual intervention)
216208

217209
#### Scenario 4: Upgrading across multiple versions
218210

@@ -235,8 +227,6 @@ Pre-flight checks:
235227
✓ Multiple version jump detected (v0.40.0 → v0.44.0)
236228
✓ Compatible upgrade path confirmed
237229
✓ Database schema changes detected
238-
Creating backup of current user data...
239-
✓ Backup created successfully
240230
Upgrading control plane components:
241231
✓ Universal Control Plane
242232
✓ Deployment Engine
@@ -284,8 +274,6 @@ graph TD
284274
CLI -->|Logs Progress| User["User"]
285275
CLI -->|Performs Pre-flight Checks| PreFlight["Pre-flight Checks"]
286276
PreFlight -->|Validates| KubernetesAPI
287-
CLI -->|Creates User Data Backup| Backup["User Data Backup"]
288-
Backup -->|Restores on Failure| Restore["Restore Mechanism"]
289277
```
290278

291279
- **Important Note:** As of April 2025, Postgres is not fully implemented yet as the data store of Radius. We use etcd in production.
@@ -298,10 +286,8 @@ graph TD
298286
CLI -->|"Initiates Upgrade"| K8sAPI["Kubernetes API"]
299287
300288
K8sAPI --> PreflightChecks["Preflight Checks"]
301-
PreflightChecks --> UserDataBackup["Backup Creation"]
302-
UserDataBackup --> ComponentUpgrade["Component Upgrade"]
289+
PreflightChecks --> ComponentUpgrade["Component Upgrade"]
303290
ComponentUpgrade --> PostUpgradeVerify["Post-Upgrade Verification"]
304-
ComponentUpgrade -- "Failure" --> UserDataRestore["User Data Restore"]
305291
306292
subgraph "Radius Control Plane"
307293
UCP["Universal Control Plane"]
@@ -331,15 +317,13 @@ graph TD
331317
ParseArgs --> ValidateVersion[Validate version compatibility]
332318
ValidateVersion --> AcquireLock[Acquire upgrade lock]
333319
AcquireLock --> RunPreflights[Run pre-flight checks]
334-
RunPreflights --> UserDataBackup[Create user data backup]
335-
UserDataBackup --> PlanUpgrade[Calculate upgrade plan]
320+
RunPreflights --> PlanUpgrade[Calculate upgrade plan]
336321
PlanUpgrade --> ExecuteHelmUpgrade[Execute Helm chart upgrade]
337322
ExecuteHelmUpgrade --> MonitorProgress[Monitor upgrade progress]
338323
MonitorProgress --> VerifyComponents[Verify component health]
339324
VerifyComponents --> Success{Successful?}
340325
Success -- Yes --> ReleaseLock[Release upgrade lock]
341-
Success -- No --> UserDataRestore[Restore user data]
342-
UserDataRestore --> RollbackHelm[Rollback Helm changes]
326+
Success -- No --> RollbackHelm[Rollback Helm changes]
343327
RollbackHelm --> ReleaseLock
344328
ReleaseLock --> End[Display results to user]
345329
```
@@ -364,21 +348,38 @@ This interface will be implemented (or existing will be improved) to handle vers
364348
- To prevent concurrent data‐modifying operations during `rad upgrade kubernetes`, we’ll rely exclusively on datastore locks (no Kubernetes leases).
365349

366350
```go
367-
// UpgradeLock is implemented per datastore (Postgres, etcd) to serialize upgrades.
351+
// UpgradeLock is implemented per datastore (Postgres, etcd) to serialize upgrades (with enhanced resilience)
368352
type UpgradeLock interface {
369-
// AcquireLock blocks until it obtains an exclusive lock or the context deadline is exceeded.
370-
AcquireLock(ctx context.Context) error
353+
// AcquireLock obtains an exclusive lock with a TTL or fails
354+
AcquireLock(ctx context.Context, ttl time.Duration) error
371355

372-
// ReleaseLock frees the lock immediately so others can proceed.
356+
// ExtendLock refreshes the TTL on an existing lock (heartbeat)
357+
ExtendLock(ctx context.Context, ttl time.Duration) error
358+
359+
// ReleaseLock explicitly releases a lock
373360
ReleaseLock(ctx context.Context) error
374361

375-
// IsUpgradeInProgress returns true if a valid (non‑stale) lock is held by another process.
362+
// IsUpgradeInProgress checks if a valid lock exists
376363
IsUpgradeInProgress(ctx context.Context) (bool, error)
364+
365+
// GetLockInfo returns metadata about the current lock
366+
GetLockInfo(ctx context.Context) (LockInfo, error)
367+
368+
// ForceReleaseLock allows admin override with reason tracking
369+
ForceReleaseLock(ctx context.Context, reason string) error
370+
}
371+
372+
type LockInfo struct {
373+
AcquiredAt time.Time
374+
ExpiresAt time.Time
375+
LockedBy string
376+
LastHeartbeatAt time.Time
377+
IsStale bool
377378
}
378379
```
379380

380381
**Timeouts:** callers must supply a context with a finite deadline (e.g. 2 min) to avoid blocking forever.
381-
**Stalelock detection:** each lock has a TTL/heartbeat; expired leases are autocleaned before AcquireLock.
382+
**Stale-lock detection:** each lock has a TTL/heartbeat; expired leases are auto-cleaned before AcquireLock.
382383
Force cleanup: --force flag allows manual removal of stale/orphaned locks.
383384

384385
Usage in CLI commands:
@@ -431,7 +432,7 @@ Checks will include:
431432
3. Database connectivity
432433
4. Custom configuration validation
433434

434-
**User Data Backup and Restore System:**
435+
**[Future Version] User Data Backup and Restore System:**
435436

436437
Rather than taking complete snapshots of the underlying databases (etcd/PostgreSQL), we'll implement a more targeted approach that backs up only the user application metadata and configuration that Radius manages:
437438

@@ -473,8 +474,8 @@ type UpgradeOptions struct {
473474
Values map[string]interface{} // Custom configuration values
474475
Timeout time.Duration // Maximum time allowed for upgrade
475476

476-
EnableUserDataBackup bool // Whether automatic user data backup is enabled
477-
BackupID string // ID of user data backup to use for recovery
477+
EnableUserDataBackup bool // Future Version: Whether automatic user data backup is enabled
478+
BackupID string // Future Version: ID of user data backup to use for recovery
478479
}
479480
```
480481

@@ -513,6 +514,7 @@ const (
513514
1. **Flexibility**: Support for custom configuration parameters allows adaptation to different environments
514515
1. **Transparency**: Clear, step-by-step output keeps users informed of the upgrade process
515516
1. **Consistency**: Ensures all Radius components are upgraded together to compatible versions
517+
1. **Safety**: Comprehensive preflight checks prevent upgrades in unsuitable conditions, while built-in user data backup and restore capabilities ensure user data is protected during upgrades
516518

517519
#### Disadvantages of this approach
518520

@@ -553,28 +555,31 @@ The implementation will primarily focus on the following components:
553555
1. **Upgrade Command**: The `rad upgrade kubernetes` command implementation in the CLI codebase
554556
2. **Version Validation**: Logic to verify compatibility between versions
555557
3. **Lock Mechanism**: Data-store-level distributed locking system
556-
4. **Backup/Restore**: User data protection system using ConfigMaps/PVs
557-
5. **Helm Integration**: Enhanced wrapper around Helm's upgrade capabilities
558-
6. **Health Verification**: Component readiness and health check mechanisms
558+
4. **Preflight Checks**: Validation system to ensure prerequisites are met before upgrade
559+
5. **[Future Version] Backup/Restore**: User data protection system using ConfigMaps/PVs
560+
6. **Helm Integration**: Enhanced wrapper around Helm's upgrade capabilities
561+
7. **Health Verification**: Component readiness and health check mechanisms
559562

560563
All components will follow Radius coding standards and include comprehensive unit tests.
561564

562565
### Error Handling
563566

564567
The upgrade process will implement the following error handling strategies:
565568

566-
1. **Pre-flight Validation**: Catch incompatibility issues before starting the upgrade
567-
2. **Graceful Timeouts**: All operations will respect user-defined or default timeouts
568-
3. **Automatic Rollback**: Failed upgrades trigger automatic restoration of previous state
569-
4. **Detailed Error Reporting**: Clear error messages with troubleshooting guidance
570-
5. **Idempotent Operations**: Commands can be safely retried after addressing issues
571-
6. **Resource Cleanup**: Temporary resources created during the upgrade are properly removed
569+
1. **Pre-flight Validation**: Catch incompatibility issues before starting the upgrade.
570+
2. **Graceful Timeouts**: All operations will respect user-defined or default timeouts.
571+
3. **Helm-based Rollback**: For version 1, failed upgrades will leverage Helm's built-in rollback capability to revert Kubernetes resources to their previous state. Note that this does not include restoration of any user data that might have been modified during the failed upgrade attempt. Full user data backup and restore capabilities will be added in a future version.
572+
4. **Detailed Error Reporting**: Clear error messages with troubleshooting guidance.
573+
5. **Idempotent Operations**: Commands can be safely retried after addressing issues.
574+
6. **Resource Cleanup**: Temporary resources created during the upgrade are properly removed.
572575

573576
## Test Plan
574577

575578
### Unit Tests
576579

577580
- Test each interface implementation independently
581+
- Test each preflight check with various input scenarios (pass/fail/warning)
582+
- Test preflight check registry with multiple checks of different severities
578583

579584
### Integration Tests
580585

@@ -617,54 +622,51 @@ The following outlines the key implementation steps required to deliver the Radi
617622

618623
- Implement the upgrade functionality in the Radius Helm client: [helmclient.go](https://github.com/radius-project/radius/blob/main/pkg/cli/helm/helmclient.go).
619624
- Add unit tests to validate Helm upgrade logic.
620-
- This task can be worked on in parallel with items 2-4. It is a blocker for item 6.
625+
- This task can be worked on in parallel with items 2-3, 4-5. It is a blocker for item 6.
621626

622627
2. **Radius Contour Client Updates**
623628

624629
- Implement the upgrade functionality in the Radius Contour client: [contourclient.go](https://github.com/radius-project/radius/blob/main/pkg/cli/helm/contourclient.go).
625630
- Add unit tests to verify correct behavior.
626-
- This task can be worked on in parallel with items 1, 3-4. It is a blocker for item 6.
631+
- This task can be worked on in parallel with items 1, 3, 4-5. It is a blocker for item 6.
627632

628633
3. **Cluster Upgrade Interface**
629634

630635
- Extend the existing cluster management interface ([cluster.go](https://github.com/radius-project/radius/blob/main/pkg/cli/helm/cluster.go#L249)) to include a new method for upgrading Radius.
631636
- Implement this method in all relevant interface implementations.
632637
- Integrate with version validation and custom configuration handling.
633638
- Add comprehensive unit tests for this functionality.
634-
- This task can be worked on in parallel with items 1-2, 4. It is a blocker for item 6.
635-
636-
4. **User Data Backup and Restore Interfaces**
637-
638-
- Define two new interfaces in the `components/database` package:
639-
- `UserDataBackup`: Responsible for creating backups of user data before the upgrade.
640-
- `UserDataRestore`: Responsible for restoring data from the backup in case of rollback.
641-
- Design versioned backup formats to handle schema migrations between versions.
642-
- This task can be worked on in parallel with items 1-3. It's a blocker for item 5.
643-
644-
5. **User Data Backup and Restore Implementation**
645-
646-
- Implement the backup and restore interfaces in the following data store implementations:
647-
- **In-memory datastore**: [inmemory/client.go](https://github.com/radius-project/radius/blob/main/pkg/components/database/inmemory/client.go)
648-
- **Postgres datastore**: [postgresclient.go](https://github.com/radius-project/radius/blob/main/pkg/components/database/postgres/postgresclient.go)
649-
- Add comprehensive unit tests for each implementation.
650-
- Implement backup storage mechanism in Kubernetes (ConfigMaps or PVs depending on size).
651-
- This task depends on item 4 (interfaces) and blocks item 6 (CLI implementation).
639+
- This task can be worked on in parallel with items 1-2, 4-5. It is a blocker for item 6.
652640

653-
6. **Upgrade Lock Mechanism**
641+
4. **Upgrade Lock Mechanism**
654642

655643
- Implement the upgrade lock interface to prevent concurrent modifications.
656644
- Update existing CLI commands to check for locks before data modification.
657-
- Can be implemented in parallel with items 1-5. Required for item 7.
645+
- Can be implemented in parallel with items 1-3, 5. Required for item 6.
658646

659-
7. **CLI Command Implementation**
647+
5. **Preflight Checks Implementation**
648+
649+
- Implement the `PreflightCheck` interface and create concrete check implementations:
650+
- **VersionCompatibilityCheck**: Validates target version is newer than current version
651+
- **ClusterResourceCheck**: Verifies the cluster has sufficient resources (CPU, memory)
652+
- **ControlPlaneHealthCheck**: Confirms current installation is in a healthy state
653+
- **CustomConfigValidationCheck**: Validates any custom configuration parameters
654+
- Create a preflight checks registry to manage and execute checks in sequence
655+
- Implement severity levels (Error, Warning, Info) and appropriate user feedback
656+
- Add unit tests for each check implementation
657+
- This task can be implemented in parallel with items 1-4 and is required for item 6.
658+
659+
6. **CLI Command Implementation**
660660

661661
- Implement the `rad upgrade kubernetes` command, integrating all previously defined components and interfaces.
662-
- Ensure the command performs pre-flight checks, user data backup creation, component upgrades, rollback on failure, and post-upgrade verification.
662+
- Ensure the command performs pre-flight checks, component upgrades, Helm-based rollback on failure, and post-upgrade verification.
663663
- Include detailed CLI output and logging for user visibility.
664664
- Add necessary unit and functional tests to validate command behavior.
665-
- This task depends on all previous tasks (1-6) and should be implemented last.
665+
- This task depends on all previous tasks (1-5) and should be implemented last.
666+
667+
### Future Versions
666668

667-
### Version 2: Data Store Migrations and Rollbacks
669+
#### Data Store Migrations and Rollbacks
668670

669671
1. Pick & embed a migration tool
670672

@@ -698,7 +700,26 @@ The following outlines the key implementation steps required to deliver the Radi
698700
- Versioning rules (major/minor jumps, compatibility guarantees)
699701
- Rollback advice: when to write reversible vs. irreversible migrations
700702

701-
### Version 3: Rollback to the most recent successful version of Radius
703+
#### Integrate User Data Backup and Restore
704+
705+
1. **User Data Backup and Restore Interfaces**
706+
707+
- Define two new interfaces in the `components/database` package:
708+
- `UserDataBackup`: Responsible for creating backups of user data before the upgrade.
709+
- `UserDataRestore`: Responsible for restoring data from the backup in case of rollback.
710+
- Design versioned backup formats to handle schema migrations between versions.
711+
- This task can be worked on in parallel with items 1-3. It's a blocker for item 5.
712+
713+
2. **User Data Backup and Restore Implementation**
714+
715+
- Implement the backup and restore interfaces in the following data store implementations:
716+
- **In-memory datastore**: [inmemory/client.go](https://github.com/radius-project/radius/blob/main/pkg/components/database/inmemory/client.go)
717+
- **Postgres datastore**: [postgresclient.go](https://github.com/radius-project/radius/blob/main/pkg/components/database/postgres/postgresclient.go)
718+
- Add comprehensive unit tests for each implementation.
719+
- Implement backup storage mechanism in Kubernetes (ConfigMaps or PVs depending on size).
720+
- This task depends on item 4 (interfaces) and blocks item 6 (CLI implementation).
721+
722+
#### Rollback to the most recent successful version of Radius
702723

703724
1. **Version History Tracking**
704725

@@ -719,7 +740,7 @@ The following outlines the key implementation steps required to deliver the Radi
719740
- Add scenarios: v0.43 → v0.44 upgrade → failure → `rad rollback` → verify control plane matches pre-upgrade state.
720741
- Test edge cases where no previous version is recorded.
721742

722-
### Version-4: Skip versions during `rad upgrade kubernetes`
743+
#### Skip versions during `rad upgrade kubernetes`
723744

724745
1. **Skip-Aware Pre-flight Checks**
725746

@@ -739,14 +760,14 @@ The following outlines the key implementation steps required to deliver the Radi
739760

740761
4. **Automated Integration Tests**
741762

742-
- Cover a variety of version skip paths in CI (adjacent vs. multiminor).
763+
- Cover a variety of version skip paths in CI (adjacent vs. multi-minor).
743764
- Fail if any migration or Helm chart upgrade in the skip path is missing.
744765

745-
### Version 5: Support for Air-Gapped Environments
766+
#### Support for Air-Gapped Environments
746767

747768
This can be discussed later.
748769

749-
### Version 6: Upgrading Radius on other platforms like `rad upgrade aci`
770+
#### Upgrading Radius on other platforms like `rad upgrade aci`
750771

751772
This can be discussed later.
752773

@@ -758,13 +779,13 @@ This can be discussed later.
758779

759780
### Implementation Risks and Mitigations
760781

761-
- **Backup Reliability**: User data backup and restore mechanisms must be thoroughly tested to ensure reliability. Consider edge cases such as backup corruption or restoration failures.
782+
- **Rollback Reliability**: Helm-based rollback mechanisms should be thoroughly tested to ensure they can return the control plane to a working state if upgrades fail.
762783
- **Lock Persistence**: Ensure upgrade locks have proper timeout mechanisms to avoid permanently locked systems if a process terminates unexpectedly.
763784

764785
### Testing Strategy
765786

766-
- **Unit Tests**: Cover all new code paths, especially backup and restore logic, upgrade logic, and error handling.
767-
- **Functional Tests**: Validate end-to-end upgrade scenarios, including successful upgrades, upgrades with custom configurations, failure scenarios, and rollback procedures.
787+
- **Unit Tests**: Cover all new code paths, especially version validation, upgrade logic, lock mechanisms, and error handling.
788+
- **Functional Tests**: Validate end-to-end upgrade scenarios, including successful upgrades, upgrades with custom configurations, failure scenarios, and Helm-based rollback procedures.
768789
- **Compatibility Tests**: Verify compatibility between different Radius CLI versions and control plane components.
769790

770791
## Open Questions

0 commit comments

Comments
 (0)