Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Minion events loop after upgrading syndic to 3005.1 #65166

Open
elwe11 opened this issue Sep 11, 2023 · 2 comments
Open

[BUG] Minion events loop after upgrading syndic to 3005.1 #65166

elwe11 opened this issue Sep 11, 2023 · 2 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Regression The issue is a bug that breaks functionality known to work in previous releases. Salt-Syndic
Milestone

Comments

@elwe11
Copy link

elwe11 commented Sep 11, 2023

I have a setup consisting of a single MOM, with some directly connected minions, and multiple syndics, each with many connected minions.

To start with everything was running Salt 3002.6. I have started an upgrade to 3005.1, as 3006.1 master/syndics are incompatible with 3002.6 minions. The upgrade of the MoM is fine. I have upgraded two syndics. The minions on the syndics started out pointing at the master on the same server. But after upgrading to 3005.1 I saw RAM and CPU usage on the syndics rising until the OOM kicked in and killed the master.

Snooping what was going on the salt events on the syndic master, after startup I see the minion send a message, which then gets repeated twice, then four times, then eight etc. growing exponentially. This happens on both upgraded syndics.

As a work around I set the minions on the syndics to point to the MoM. With this everything was working well and I started upgrading the minions pointing to the upgraded syndics. But when I came to the first minion with a salt managed x509 certificate that state failed. Snooping the events again, running the master on the syndic in debug mode and reading the code, I discovered that the state causes the minion to publish a request to the master process on the syndic, destined for the minion on the syndic, which is the CA. However, the minion on the syndic is no longer connected to the master on the syndic, I had to point it to the MoM, so the publish call returns an empty list of minions, leading to the x509.certificate_managed state encountering a type error in create_certificate in the x509 module because certs is None and "if not any(certs):" is not checking that certs is iterable. For now I have pointed that minion at the MoM as a work around, but this is not sustainable. I need a fix for the exponentially growing events so I can point the minion on the syndic back at the master on the syndic.

I came across this workaround reading through bug #62577. From the comment on the 10th of September 2022 on it sounds like exactly the issue I am seeing, but they don't seem to be using the x509 module, so they can live with the work around. The work around also causes me issues as my syndics and minions are distributed globally, so running a high state against the MoM can take 20-30 minutes due to the latency between continents.

The MoM and the syndics are all VMs, a mixture of RHEV and VMWare. The minions are a mixture of hardware, RHEV and VMWare. For 3005 I am using the classic packaging. The MoM, syndics and a majority of the Minions are RHEL 7.9. The rest of the minions are RHEL 8.6 or 8.8.

The steps to reproduce are to setup a MoM and syndic running 3005, with the minion on the syndic pointing at the master on the syndic. To make the OOM show up quicker, add some minions pointing at the syndic.

I expect 3005.1 to behave as 3002.6 (and previous versions) when a minion on a syndic points at the master on the same server. I expect it not to cause an exponentially growing event storm.

salt --versions-report Salt Version: Salt: 3005.1

Dependency Versions:
cffi: 1.9.1
cherrypy: unknown
dateutil: Not Installed
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 2.11.1
libgit2: Not Installed
M2Crypto: 0.35.2
Mako: Not Installed
msgpack: 0.6.2
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: 2.14
pycrypto: Not Installed
pycryptodome: Not Installed
pygit2: Not Installed
Python: 3.6.8 (default, Aug 13 2020, 07:46:32)
python-gnupg: Not Installed
PyYAML: 3.13
PyZMQ: 18.0.1
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.1.4

System Versions:
dist: rhel 7.9 Maipo
locale: UTF-8
machine: x86_64
release: 3.10.0-1160.95.1.el7.x86_64
system: Linux
version: Red Hat Enterprise Linux Server 7.9 Maipo

@elwe11 elwe11 added Bug broken, incorrect, or confusing behavior needs-triage labels Sep 11, 2023
@welcome
Copy link

welcome bot commented Sep 11, 2023

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar.
If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!

@elwe11
Copy link
Author

elwe11 commented Sep 11, 2023

When I run the master on the syndic is debug mode I see messages like:

[DEBUG ] Sending event: tag = syndic/<syndic fqdn>/syndic/<syndic fqdn>/syndic/<syndic fqdn>/syndic/<syndic fqdn>/syndic/<syndic fqdn>/syndic/<syndic fqdn>/salt/auth; data = {'result': True, 'act': 'accept', 'id': '<syndic fqdn>', 'pub': '-----BEGIN PUBLIC KEY-----\n<key data>\n-----END PUBLIC KEY-----\n', '_stamp': '2023-09-11T18:47:34.460863'}

And I see the same on the event bus. It is note worthy that the first message has the syndic fqdn once in the tag, then I start to see the same message with syndic/<fqdn> prepended more and more times.

@anilsil anilsil added this to the Argon v3008 milestone Sep 11, 2023
@OrangeDog OrangeDog added Regression The issue is a bug that breaks functionality known to work in previous releases. Salt-Syndic labels Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Regression The issue is a bug that breaks functionality known to work in previous releases. Salt-Syndic
Projects
None yet
Development

No branches or pull requests

4 participants