You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(kb): add abort capability and enum standardization for zero-downtime update (#274)
Because
- Knowledge Base updates lack graceful cancellation - once started,
updates must complete or fail, with no way to manually abort them when
issues arise
- Status values were represented as untyped strings across the system,
causing inconsistencies between protobuf definitions, backend code,
database storage, and test assertions
- Files that haven't started processing (`NOTSTARTED`, `WAITING`) were
incorrectly blocking KB synchronization, causing integration tests to
timeout indefinitely
- The system lacked comprehensive integration test coverage for update
lifecycle edge cases and concurrent operations during updates
This commit
**Adds Abort API for Knowledge Base Updates**
- **Introduces `AbortKnowledgeBaseUpdateAdmin` API** enabling
administrators to gracefully cancel ongoing KB updates
- **Cancels active Temporal workflows** and cleans up staging KB
resources (both completed and incomplete files)
- **Supports selective abort** - can abort specific catalogs by ID or
all currently updating catalogs
- **Sets KB status to `ABORTED`** allowing immediate re-upload after
cancellation
**Standardizes Enum Values Across the Stack**
- **Replaces string-based status values** (`"updating"`, `"completed"`)
with strongly-typed protobuf enum constants
(`KNOWLEDGE_BASE_UPDATE_STATUS_UPDATING`,
`KNOWLEDGE_BASE_UPDATE_STATUS_COMPLETED`)
- **Adds database migration (000034)** to update all existing records to
use standardized enum values
- **Ensures consistency** between protobuf definitions, Go backend code
(via `.String()`), database storage, and integration test assertions
- **Introduces 10 well-documented lifecycle states**: UNSPECIFIED, NONE,
UPDATING, SYNCING, VALIDATING, SWAPPING, COMPLETED, FAILED, ROLLED_BACK,
ABORTED
**Fixes Synchronization Logic**
- **Excludes unstarted files** from synchronization wait - only actively
processing files (`PROCESSING`, `CONVERTING`, `CHUNKING`, `EMBEDDING`)
now block the swap
- **Prevents indefinite hangs** when files are uploaded but not yet
processed during the update lifecycle
- **Enables proper KB locking** - after lock, no new processing starts,
so unstarted files can be safely ignored
**Adds Comprehensive Test Coverage**
- **Group 5: 10 corner case tests** covering file operations during
different update phases (adding/deleting during swap, race conditions,
rapid operations, late-phase uploads)
- **Updated all test assertions** to handle both legacy string format
and new enum format for backward compatibility
- **Fixed `pollUpdateCompletion` helper** to recognize new enum-based
status values
## Architecture Highlights
### Zero-Downtime Update Strategy
1. **Phase 1-2 (UPDATING)**: Create staging KB, reprocess all files with
new config
2. **Phase 3 (SYNCING)**: Lock KB, wait for active processing to
complete (excluding unstarted files)
3. **Phase 4 (VALIDATING)**: Verify data integrity (file counts,
embeddings, chunks)
4. **Phase 5 (SWAPPING)**: Atomic pointer swap - production → rollback,
staging → production
5. **Phase 6 (COMPLETED)**: Cleanup, retain rollback KB for configured
period
### Dual-Processing During Updates
- Files uploaded during update are **processed to both production and
staging KBs**
- After swap, these files exist in correct collections without
reprocessing
- Files deleted during update are **removed from both KBs** (dual
deletion)
### Graceful Abort
- Cancels ongoing Temporal workflows via
`workflowClient.CancelWorkflow()`
- Cleans up staging KB resources (files, Milvus collection, metadata)
- Production KB remains unchanged and immediately available for new
updates
- No orphaned resources or stuck workflows
0 commit comments