Skip to content

Conversation

@ryancurrah
Copy link
Contributor

@ryancurrah ryancurrah commented Mar 21, 2025

This resolves a fatal "concurrent map iteration and map write" error by synchronizing reads and writes to syncLockMap with a read-write mutex. This change has been running in our environment for 7 days (8,764 workflows) without reoccurrence.

Fixes #14300

Motivation

To stop fatal panics when reading and writing to the syncLockMap map at the same time.

Modifications

  • Added a read lock in the CheckWorkflowExistence function.
  • Replaced the exclusive lock with a read-write mutex.
  • For functions that only read syncLockMap, a read lock is used to allow concurrent reads safely.

Verification

  • Built the Argo Workflows controller image with make workflow-controller-image.
  • Pushed it to our Artifactory container registry.
  • Deployed in a non-production cluster (running production-like workloads in “shadow mode”).
  • Monitored logs for 7 days (8,764 workflows) with alerts on the known error message.
  • Observed no recurrence of the concurrency issue.

Documentation

N/A

This resolves a fatal "concurrent map iteration and map write" error
by synchronizing reads and writes to syncLockMap with a read-write
mutex. This change has been running in our environment for 7 days (8,764 workflows)
without reoccurrence.

Issue: argoprojGH-14300
Signed-off-by: Ryan Currah <ryan@currah.ca>
@ryancurrah ryancurrah changed the title fix(workflow/sync): use RWMutex to prevent concurrent map access fix(workflow/sync): use RWMutex to prevent concurrent map access Fixes #14300 Mar 21, 2025
@ryancurrah ryancurrah changed the title fix(workflow/sync): use RWMutex to prevent concurrent map access Fixes #14300 fix(workflow/sync): use RWMutex to prevent concurrent map access Mar 21, 2025
@Joibel Joibel requested review from Joibel and Copilot March 21, 2025 15:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a fatal "concurrent map iteration and map write" error by switching from a basic mutex to a read-write mutex in order to better synchronize access when workflows are checked, released, or released all at once.

  • The Manager struct now uses a sync.RWMutex.
  • CheckWorkflowExistence now acquires a read lock.
  • The Release and ReleaseAll methods have been updated to use read locks.
Comments suppressed due to low confidence (1)

workflow/sync/sync_manager.go:342

  • If the ReleaseAll method modifies the syncLockMap or its contents, consider using a write lock to ensure safe concurrent modifications.
sm.lock.RLock()

@Joibel Joibel enabled auto-merge (squash) March 21, 2025 15:45
@Joibel
Copy link
Member

Joibel commented Mar 21, 2025

/cherry-pick release-3.6

@Joibel Joibel merged commit 73a8e70 into argoproj:main Mar 21, 2025
37 checks passed
gcp-cherry-pick-bot bot pushed a commit that referenced this pull request Mar 21, 2025
kubabuczak pushed a commit to splunk/argo-workflows that referenced this pull request Mar 24, 2025
Joibel pushed a commit that referenced this pull request Mar 24, 2025
…rry-pick #14321) (#14322)

Signed-off-by: Ryan Currah <ryan@currah.ca>
Co-authored-by: Ryan Currah <ryan@currah.ca>
kim-codefresh added a commit to codefresh-io/argo-workflows that referenced this pull request May 20, 2025
…abilities fixes (Cr 28355) (#358)

* fix: bump deps for k8schain to fix ecr-login (argoproj#14008) (release-3.6 cherry-pick) (argoproj#14174)

* fix(ci): python sdk release process (release-3.6) (argoproj#14183)

Signed-off-by: Alan Clucas <alan@clucas.org>

* docs: clarify qps/burst on controller (cherry-pick argoproj#14190) (argoproj#14192)

Signed-off-by: Tim Collins <tim@thecollins.team>
Co-authored-by: Tim Collins <45351296+tico24@users.noreply.github.com>

* fix(api/jsonschema): use unchanging JSON Schema version (cherry-pick argoproj#14092) (argoproj#14256)

Signed-off-by: Roger Peppe <rogpeppe@gmail.com>
Co-authored-by: Roger Peppe <rogpeppe@gmail.com>

* fix(api/jsonschema): use working `$id` (cherry-pick argoproj#14257) (argoproj#14258)

Signed-off-by: Roger Peppe <rogpeppe@gmail.com>
Co-authored-by: Roger Peppe <rogpeppe@gmail.com>

* docs: autogenerate tested k8s versions and centralize config (argoproj#14176) (release-3.6) (argoproj#14262)

Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
Signed-off-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Mason Malone <651224+MasonM@users.noreply.github.com>

* chore(deps): bump minio-go to newer version (argoproj#14185) (release-3.6) (argoproj#14261)

Co-authored-by: Vaibhav Kaushik <vaibhavkaushik@salesforce.com>

* fix: split pod controller from workflow controller (argoproj#14129) (release-3.6) (argoproj#14263)

* chore(deps): fix snyk (argoproj#14264) (release-3.6) (argoproj#14268)

* chore: revert to correct k8s versions

Accidental bump from argoproj#14176 cherry-pick

Signed-off-by: Alan Clucas <alan@clucas.org>

* chore(deps): bump github.com/go-jose/go-jose/v3 from 3.0.3 to 3.0.4 in the go_modules group (cherry-pick argoproj#14231) (argoproj#14269)

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: wait for workflow informer to sync before pod informer (cherry-pick argoproj#14248) (argoproj#14266)

Signed-off-by: Rohan K <rohankmr414@gmail.com>
Co-authored-by: Rohan K <rohankmr414@gmail.com>

* fix(cli): remove red from log colour selection. Fixes argoproj#6740 (cherry-pick argoproj#14215) (argoproj#14278)

Signed-off-by: Prabakaran Kumaresshan <4676330+nixphix@users.noreply.github.com>
Co-authored-by: Prabakaran Kumaresshan <4676330+nixphix@users.noreply.github.com>

* fix: correct semaphore configmap keys for multiple semaphores (argoproj#14184) (release-3.6) (argoproj#14281)

* fix: don't print help for non-validation errors. Fixes argoproj#14234 (cherry-pick argoproj#14249) (argoproj#14283)

Signed-off-by: Koichi Shimada <jumpe1programming@gmail.com>
Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
Co-authored-by: koichi <51446844+jumpe1@users.noreply.github.com>
Co-authored-by: Mason Malone <651224+MasonM@users.noreply.github.com>

* docs: fix kubernetes versions (release-3.6) (argoproj#14273)

Signed-off-by: Alan Clucas <alan@clucas.org>

* fix(workflow/sync): use RWMutex to prevent concurrent map access (cherry-pick argoproj#14321) (argoproj#14322)

Signed-off-by: Ryan Currah <ryan@currah.ca>
Co-authored-by: Ryan Currah <ryan@currah.ca>

* chore(lint): update golangci-lint to 2.1.1 (argoproj#14390) (cherry-pick release-3.6) (argoproj#14417)

* chore: bump golang 1.23->1.24 (argoproj#14385) (cherry-pick release-3.6) (argoproj#14418)

* fix: gracefully handle invalid CronWorkflows and simplify logic.  (cherry-pick argoproj#14197) (argoproj#14419)

Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>

* fix: prevent dfs sorter infinite recursion on cycle. Fixes argoproj#13395 (cherry-pick argoproj#14391) (argoproj#14420)

Signed-off-by: Adrien Delannoy <a.delannoyfr@gmail.com>
Co-authored-by: Adrien Delannoy <a.delannoyfr@gmail.com>

* chore(deps): bump github.com/expr-lang/expr from 1.16.9 to 1.17.0 (argoproj#14307) (release-3.6) (argoproj#14421)

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps)!: update k8s and argo-events (release-3.6) (argoproj#14424)

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: william.vanhevelingen <william.vanhevelingen@acquia.com>
Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
Signed-off-by: William Van Hevelingen <william.vanhevelingen@acquia.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: William Van Hevelingen <William.VanHevelingen@acquia.com>
Co-authored-by: Mason Malone <651224+MasonM@users.noreply.github.com>

* fix: correct retry logic (argoproj#13734) (release-3.6) (argoproj#14428)

Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Isitha Subasinghe <isitha@pipekit.io>

* fix: manual retries exit handler cleanup. Fixes argoproj#14180 (argoproj#14181) (release-3.6) (argoproj#14429)

Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Isitha Subasinghe <isitha@pipekit.io>

* fix: correct manual retry logic. Fixes argoproj#14124 (argoproj#14328) (release-3.6) (argoproj#14430)

Signed-off-by: oninowang <oninowang@tencent.com>
Signed-off-by: Alan Clucas <alan@clucas.org>
Co-authored-by: jswxstw <jswxstw@gmail.com>

* fix: disable ALPN in argo-server as a workaround (argoproj#14433)

Signed-off-by: Alan Clucas <alan@clucas.org>

* result of codegen

Signed-off-by: Kim <kim.aharfi@codefresh.io>

* fix:lint

Signed-off-by: Kim <kim.aharfi@codefresh.io>

---------

Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Tim Collins <tim@thecollins.team>
Signed-off-by: Roger Peppe <rogpeppe@gmail.com>
Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Rohan K <rohankmr414@gmail.com>
Signed-off-by: Prabakaran Kumaresshan <4676330+nixphix@users.noreply.github.com>
Signed-off-by: Koichi Shimada <jumpe1programming@gmail.com>
Signed-off-by: Ryan Currah <ryan@currah.ca>
Signed-off-by: Adrien Delannoy <a.delannoyfr@gmail.com>
Signed-off-by: william.vanhevelingen <william.vanhevelingen@acquia.com>
Signed-off-by: William Van Hevelingen <william.vanhevelingen@acquia.com>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: oninowang <oninowang@tencent.com>
Signed-off-by: Kim <kim.aharfi@codefresh.io>
Co-authored-by: Alan Clucas <alan@clucas.org>
Co-authored-by: gcp-cherry-pick-bot[bot] <98988430+gcp-cherry-pick-bot[bot]@users.noreply.github.com>
Co-authored-by: Tim Collins <45351296+tico24@users.noreply.github.com>
Co-authored-by: Roger Peppe <rogpeppe@gmail.com>
Co-authored-by: Mason Malone <651224+MasonM@users.noreply.github.com>
Co-authored-by: Vaibhav Kaushik <vaibhavkaushik@salesforce.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Rohan K <rohankmr414@gmail.com>
Co-authored-by: Prabakaran Kumaresshan <4676330+nixphix@users.noreply.github.com>
Co-authored-by: koichi <51446844+jumpe1@users.noreply.github.com>
Co-authored-by: Ryan Currah <ryan@currah.ca>
Co-authored-by: Adrien Delannoy <a.delannoyfr@gmail.com>
Co-authored-by: William Van Hevelingen <William.VanHevelingen@acquia.com>
Co-authored-by: Isitha Subasinghe <isitha@pipekit.io>
Co-authored-by: jswxstw <jswxstw@gmail.com>
@Joibel Joibel mentioned this pull request Dec 8, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Argo Controller Crash Due to Concurrent Map Access

2 participants