Skip to content

Comments

Upgrading stucked when resharding to empty a node#20

Merged
albertompe merged 2 commits intodevelopfrom
bugfix/stucked-resharding
Jan 29, 2026
Merged

Upgrading stucked when resharding to empty a node#20
albertompe merged 2 commits intodevelopfrom
bugfix/stucked-resharding

Conversation

@albertompe
Copy link
Contributor

Closes #19

Always reconcile when in Upgrading status.
Slots in importing status are taken into account to detect open slots.

@albertompe albertompe merged commit 1870fda into develop Jan 29, 2026
3 of 4 checks passed
@albertompe albertompe deleted the bugfix/stucked-resharding branch January 29, 2026 12:17
albertompe added a commit that referenced this pull request Feb 11, 2026
* Check slots in importing state when stabilizing

* Always renconcile when in Upgrading status
albertompe added a commit that referenced this pull request Feb 11, 2026
* Upload code (#1)

* feat: upload code

* fix: typo

* fix: compliance check

* fix: add go.sum to REUSE.toml

* feat: add makefile goal to port forward metrics

* fix: makefile

* Redis Operator logic and API

* base API implementation

* add cluster status and check method signature

* project restructuration

* cluster rebalance

* move endpoint

* use logr

* server MoveNodSlots tests

* redis cluster tests

* redis client tests

* util tests

* config tests

* check nodes integrity

* node restart recovery

* scale up and down

* scale up and down tests

* upgrade cluster and reset node endpoint

* compliance check

* tests

* get cluster nodes endpoint

* replicas per master

* move slots with replicas

* ensure all nodes are up before running redis-cli and parameters for config and disable metrics

* api and architecture

* compliance check

* update architecture figure

* http server refactor

* compliance check

* handle HTTP shutdown

* use slog

* operation refactor

* compliance check

* remove cluster operation

* rediscluster package

* documentation

* handles get nodes test idempotency

* add health endpoint

* balance cluster if node has less slots than expected (#4)

* Add Redis Standalone Mode support

* standalone mode

* rename rediscluster package to cluster

* rename redisclusters

* use latest go version

* add Configuring status required to build the cluster after initializing the individual node pods

* fix formatting in log messages when unknown fields are found from cluster info when gathering metrics

---------

Co-authored-by: Alberto Martínez Pérez <albertompe@ext.inditex.com>

* Renaming rediscluster endpoints as redkeycluster (#5)

* Update Go version to v1.24.6

* Rename endpoints to use redkeycluster

* Delete .go-version file

* Rename handler funcionts to use RedKeyCluster instead RedisCluster/Cluster

* Rename RedisCluster as RedKeyCluster

* Rename internal/cluster/redis.go as internal/cluster/redkey.go

* Rename the value of redkeyClusterMetrics constant

* Trim spaces

* Renaming Redis Operator as RedKey Operator

* Cleanup

* Add RedKey Robin features

* Rename RedisRobin as RedkeyRobin

* Rename RedisRobin as RedkeyRobin

* Rename RedisRobin as RedkeyRobin

* Rename RedisRobin as RedkeyRobin

* Rename RedisRobin as RedkeyRobin

* Rename RedisRobin as RedkeyRobin

* Rename RedisRobin as RedkeyRobin

* Rename RedisRobin as RedkeyRobin

* Rename RedisRobin as RedkeyRobin

* Added new endpoint to recreate the cluster (#6)

* Fix panic when resharding and tests (#8)

* feat: fix move panic and operations tests

* fix: verify

* feat: redkey cluster client and operation factory

* feat: configuration variable for meet sleep time and redkey tests

* feat: configuration variable for node reset sleep time and redkey tests

* feat: redkey tests

* fix: compliance

* feat: tests

* feat: format code

* feat: RedisOperationFix tests

* feat: redis node tests

* Fix logging message when starting cluster upgrading

* feat: set 1.24.6 golang version

---------

Co-authored-by: Alberto Martínez Pérez <albertompe@ext.inditex.com>

* Check for open slots over reconciliations when upgrading and fixing upgrading logic (#10)

* Update stored nodes info when removing a node

* Add redis-cli to debug image, set the port 40002 in Makefile for port forwarding and add debug launch configuration

* Update .gitignore file

* Check if open slots persist over reconciliations and stabilize them

* Add copyright info

* Added GetNodeById function

* Configuration paramter added to stablish the slot stabilization threshold

* Execute setslot stable on both ends when stabilizing an open slot

* Add error handling and log tracing to slots stabilization

* Update stabilize slots reconciliation threshold to 3 reconciliation loops by default

* Fix migrating/importing slots detection

* Node list is now updated taking into account the required replicas to discard/forget exceeding nodes

* Fix test error

* Fix test TestGetNodesInfo

* Add tests for function stabilizeOpenSlots() and fix failing tests

* Error handling when executing commands (#11)

* Manage errors when executing commands

* Add tests to check error handling

* fmt

* Fixes and stabilization (#12)

* Rename function

* Rename function

* Refresh nodes list only when needed

* Launching a reconciliation for Ready status when scaling up may cause non desired operations and interactions with scaling up reconciliation

* Build images using the defined Golang, Delve and Redis Client from Makefile to match the Redis version

* Remove nodes one by one to avoid problems when a rebalancing or forget fall in error

* Expose /v1/cluster/status endpoint

* Update docker files

* Fix Dockerfile group and user creation

* Try to fix redkey cluster if rebalance fails when scaling down

* Execute redis-cli commands to first redkey cluster node

* Update getClient() function to fallback to default address when no nodes are defined

* Remove error logging

* Add tests to getClient()

* We reset the call to reconcileReadyStatus() when performing a scaling up

* Fix IMG default value

* Fix IMG default value

Signed-off-by

* Fix IMG default value

* Robin using primaries and decoupling Operator from Redis (#13)

* Rename master nodes to primary nodes

* Rename master nodes to primary nodes

* Node field masterId renames as primaryId

* Move from master nodes to primary nodes

* Remove slots from Robin response when asking for cluster nodes

* Remove flags from Robin cluster nodes endpoint and add role

* Update openapi-rest.yml file

* Fix Operator E2E tests (#14)

* feat: recreate operation, optimize Dockerfile, conflicting operations and timeout for cluster check

* feat: recreate operation and conficlicting operation tests, use of getConflictOperation in all public operations, primaries are selected by name, fix httpserver tests and minor fixes

* feat: variabilize cluster check timeout

* feat: conflict matrix revision and operation conflict error

* fix: minor bugs in tests.

---------

Co-authored-by: Daniel Dorado <danieladf@inditex.com>

* Start the server and then initialize the cluster. Error handling fixed

* Start the server and then initialize the cluster. Error handling fixed

* Exit when error instantiating reconciler or poller. Standardize error handling in goroutines

* fix: remove E2E test Make targets that are not being used

* Update Go version to v1.25.6 (#17)

* Update Go version to v1.25.6

* Move go config to .tool-versions file

* Add copyright info

* fix: .tool-version file. (#18)

* Upgrading stucked when resharding to empty a node (#20)

* Check slots in importing state when stabilizing

* Always renconcile when in Upgrading status

* Prepare release (#21)

* Update README.md file adding project detailed info

* Update Go version to 1.25.7

* Renaming Makefile variables

* Update version

* Add release workflow

* Rename job in release workflow

* chore: update CODEOWNERS

* Update SECURITY.md (#7)

* Update CLA mention in CONTRIBUTING.md

Removed link formatting from CLA mention.

---------

Co-authored-by: Miguel Martínez del Horno <148335298+miguelmdh@users.noreply.github.com>
Co-authored-by: Alberto Martínez Pérez <alb.mtex@gmail.com>
Co-authored-by: Daniel Dorado <danieladf@inditex.com>
Co-authored-by: Jorge Teixeira <jorgetcr@ext.inditex.com>
Co-authored-by: Javier Pérez Arias <javier.pa.tech@gmail.com>
Co-authored-by: Mariano Alonso Ortiz <60620696+marianoao@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Regression: RKCL slow upgrading gest sutcked when resharding

2 participants