[BUG] Minion events loop after upgrading syndic to 3005.1 #65166
Labels
Bug
broken, incorrect, or confusing behavior
Regression
The issue is a bug that breaks functionality known to work in previous releases.
Salt-Syndic
Milestone
I have a setup consisting of a single MOM, with some directly connected minions, and multiple syndics, each with many connected minions.
To start with everything was running Salt 3002.6. I have started an upgrade to 3005.1, as 3006.1 master/syndics are incompatible with 3002.6 minions. The upgrade of the MoM is fine. I have upgraded two syndics. The minions on the syndics started out pointing at the master on the same server. But after upgrading to 3005.1 I saw RAM and CPU usage on the syndics rising until the OOM kicked in and killed the master.
Snooping what was going on the salt events on the syndic master, after startup I see the minion send a message, which then gets repeated twice, then four times, then eight etc. growing exponentially. This happens on both upgraded syndics.
As a work around I set the minions on the syndics to point to the MoM. With this everything was working well and I started upgrading the minions pointing to the upgraded syndics. But when I came to the first minion with a salt managed x509 certificate that state failed. Snooping the events again, running the master on the syndic in debug mode and reading the code, I discovered that the state causes the minion to publish a request to the master process on the syndic, destined for the minion on the syndic, which is the CA. However, the minion on the syndic is no longer connected to the master on the syndic, I had to point it to the MoM, so the publish call returns an empty list of minions, leading to the x509.certificate_managed state encountering a type error in create_certificate in the x509 module because certs is None and "if not any(certs):" is not checking that certs is iterable. For now I have pointed that minion at the MoM as a work around, but this is not sustainable. I need a fix for the exponentially growing events so I can point the minion on the syndic back at the master on the syndic.
I came across this workaround reading through bug #62577. From the comment on the 10th of September 2022 on it sounds like exactly the issue I am seeing, but they don't seem to be using the x509 module, so they can live with the work around. The work around also causes me issues as my syndics and minions are distributed globally, so running a high state against the MoM can take 20-30 minutes due to the latency between continents.
The MoM and the syndics are all VMs, a mixture of RHEV and VMWare. The minions are a mixture of hardware, RHEV and VMWare. For 3005 I am using the classic packaging. The MoM, syndics and a majority of the Minions are RHEL 7.9. The rest of the minions are RHEL 8.6 or 8.8.
The steps to reproduce are to setup a MoM and syndic running 3005, with the minion on the syndic pointing at the master on the syndic. To make the OOM show up quicker, add some minions pointing at the syndic.
I expect 3005.1 to behave as 3002.6 (and previous versions) when a minion on a syndic points at the master on the same server. I expect it not to cause an exponentially growing event storm.
salt --versions-report
Salt Version: Salt: 3005.1Dependency Versions:
cffi: 1.9.1
cherrypy: unknown
dateutil: Not Installed
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 2.11.1
libgit2: Not Installed
M2Crypto: 0.35.2
Mako: Not Installed
msgpack: 0.6.2
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: 2.14
pycrypto: Not Installed
pycryptodome: Not Installed
pygit2: Not Installed
Python: 3.6.8 (default, Aug 13 2020, 07:46:32)
python-gnupg: Not Installed
PyYAML: 3.13
PyZMQ: 18.0.1
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.1.4
System Versions:
dist: rhel 7.9 Maipo
locale: UTF-8
machine: x86_64
release: 3.10.0-1160.95.1.el7.x86_64
system: Linux
version: Red Hat Enterprise Linux Server 7.9 Maipo
The text was updated successfully, but these errors were encountered: