Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Batch execution of multiple modules fails with salt global exception #60361

Closed
bzukdatto opened this issue Jun 11, 2021 · 7 comments · Fixed by #61519
Closed

[BUG] Batch execution of multiple modules fails with salt global exception #60361

bzukdatto opened this issue Jun 11, 2021 · 7 comments · Fixed by #61519
Assignees
Labels
Bug broken, incorrect, or confusing behavior Phosphorus v3005.0 Release code name and version Regression The issue is a bug that breaks functionality known to work in previous releases. severity-high 2nd top severity, seen by most users, causes major problems

Comments

@bzukdatto
Copy link

Description
Batched running of multiple modules fails due to TypeError exception. Both batch.py and salt.py assume a single return code for batched execution runs, but running multiple modules results in a dictionary of return codes.

Applied the following patches to job.py and salt.py and they appear to fix the issue

root@use1-salt-1:/usr/lib/python3/dist-packages/salt/cli# diff -u `pwd`/salt.py  /root/salt.py_patched
diff --git a/cli/salt.py b/cli/salt.py
index c672130..f3c8789 100644
--- a/cli/salt.py
+++ b/cli/salt.py
@@ -299,6 +299,13 @@ class SaltCMD(salt.utils.parsers.SaltCMDOptionParser):
             for res in batch.run():
                 for ret in res.values():
                     job_retcode = salt.utils.job.get_retcode(ret)
+                    # handle batched running of multiple modules
+                    if isinstance(job_retcode, dict):
+                        try:
+                            job_retcode = max(job_retcode.values())
+                        # handle no retcode values in dict
+                        except ValueError:
+                            job_retcode = 0
                     if job_retcode > retcode:
                         # Exit with the highest retcode we find
                         retcode = job_retcode


--- /root/job.py        2021-01-14 09:16:03.332204018 -0500
+++ /usr/lib/python3/dist-packages/salt/utils/job.py    2021-01-14 09:10:58.415043396 -0500
@@ -174,8 +174,14 @@
     """
     retcode = 0
     # if there is a dict with retcode, use that
-    if isinstance(ret, dict) and ret.get("retcode", 0) != 0:
-        return ret["retcode"]
+    if isinstance(ret, dict):
+        retcode = ret.get("retcode", 0)
+        if isinstance(retcode, dict):
+            for retcode in retcode.values():
+                if retcode != 0:
+                    return retcode
+        elif retcode != 0:
+            return retcode
     # if its a boolean, False means 1
     elif isinstance(ret, bool) and not ret:
         return 1

Setup
Salt master and multiple syndics all running 3003.
Minions running 3003

Steps to Reproduce the behavior
Run a batch command with multiple modules:
salt --no-color --hide-timeout --timeout 30 --out json --out-file /tmp/cldeng-20519.txt '*' -b 50 virt.full_info,cmd.run ,'/bin/true'

Get the following stack trace:

[ERROR   ] An un-handled exception was caught by salt's global exception handler:
TypeError: '>' not supported between instances of 'dict' and 'int'
Traceback (most recent call last):
  File "/usr/bin/salt", line 11, in <module>
    load_entry_point('salt==3003', 'console_scripts', 'salt')()
  File "/usr/lib/python3/dist-packages/salt/scripts.py", line 539, in salt_main
    client.run()
  File "/usr/lib/python3/dist-packages/salt/cli/salt.py", line 61, in run
    self._run_batch()
  File "/usr/lib/python3/dist-packages/salt/cli/salt.py", line 302, in _run_batch
    if job_retcode > retcode:
TypeError: '>' not supported between instances of 'dict' and 'int'
Traceback (most recent call last):
  File "/usr/bin/salt", line 11, in <module>
    load_entry_point('salt==3003', 'console_scripts', 'salt')()
  File "/usr/lib/python3/dist-packages/salt/scripts.py", line 539, in salt_main
    client.run()
  File "/usr/lib/python3/dist-packages/salt/cli/salt.py", line 61, in run
    self._run_batch()
  File "/usr/lib/python3/dist-packages/salt/cli/salt.py", line 302, in _run_batch
    if job_retcode > retcode:
TypeError: '>' not supported between instances of 'dict' and 'int'

(Include debug logs if possible and relevant)

Expected behavior
Commands return successfully and report the correct return code (highest value returned)

Screenshots
If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
# salt --versions-report
Salt Version:
          Salt: 3003

Dependency Versions:
          cffi: Not Installed
      cherrypy: unknown
      dateutil: 2.6.1
     docker-py: Not Installed
         gitdb: 2.0.3
     gitpython: 2.1.8
        Jinja2: 2.10
       libgit2: Not Installed
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 0.5.6
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: Not Installed
      pycrypto: Not Installed
  pycryptodome: 3.4.7
        pygit2: Not Installed
        Python: 3.6.9 (default, Apr 18 2020, 01:56:04)
  python-gnupg: 0.4.1
        PyYAML: 3.12
         PyZMQ: 17.1.2
         smmap: 2.0.3
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.2.5

System Versions:
          dist: ubuntu 18.04 Bionic Beaver
        locale: UTF-8
       machine: x86_64
       release: 4.15.0-99-generic
        system: Linux
       version: Ubuntu 18.04 Bionic Beaver

Additional context
Add any other context about the problem here.

@bzukdatto bzukdatto added Bug broken, incorrect, or confusing behavior needs-triage labels Jun 11, 2021
@welcome
Copy link

welcome bot commented Jun 11, 2021

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar.
If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!

@waynew
Copy link
Contributor

waynew commented Jun 14, 2021

Verified this using this command

salt \* -b 1 test.ping,test.version ,'/bin/true'

I get the same result.

Looking at the code, job.py has basically been unchanged, however there have been a few changes to batch.py. I'm not positive if that's where things went a bit sideways, but it looks like the return isn't returning the expected sort of value 🤔

This fix would stop the exception, but AFAICT this should be working (though, I couldn't get back to a point where it was working, so it's possible that it never did? 🤷 )

@waynew waynew added this to the Approved milestone Jun 14, 2021
@waynew
Copy link
Contributor

waynew commented Jun 15, 2021

Digging into this a bit more, it's definitely due to batch.py getting horked up. In fact, replacing parts of the existing batch.py -- specifically if you put this block in:

salt/salt/cli/batch.py

Lines 222 to 247 in ead47e4

if minion in active:
active.remove(minion)
if bwait:
wait.append(datetime.now() + timedelta(seconds=bwait))
if self.opts.get('raw'):
ret[minion] = data
yield data
elif self.opts.get('failhard'):
# When failhard is passed, we need to return all data to include
# the retcode to use in salt/cli/salt.py later. See issue #24996.
ret[minion] = data
yield {minion: data}
else:
ret[minion] = data['ret']
yield {minion: data['ret']}
if not self.quiet:
ret[minion] = data['ret']
data[minion] = data.pop('ret')
if 'out' in data:
out = data.pop('out')
else:
out = None
salt.output.display_output(
data,
out,
self.opts)

No other changes are necessary - that will make this existing code work. Granted, I don't think that's a comprehensive fix, but it does solve this particular problem.

Some time between then and now, that block got changed in an unexpected way, causing the failure that we're seeing. From what I can see, the correct fix would be to add some tests to verify that run is, in fact, returning expected data not just for the cases where there's a single return, but for multiple returns and every case, and ensure that all of the expected data is being returned. Naturally there should be some failures based on the current state, but... that's where this fix should happen 👍

@waynew waynew added the severity-high 2nd top severity, seen by most users, causes major problems label Jun 15, 2021
@waynew waynew removed their assignment Jun 15, 2021
@sagetherage sagetherage modified the milestones: Approved, Phosphorus Jun 18, 2021
@sagetherage sagetherage added Phosphorus v3005.0 Release code name and version Regression The issue is a bug that breaks functionality known to work in previous releases. labels Jun 18, 2021
@bzukdatto
Copy link
Author

I was going to test the above batch.py to see if it would solve this bug, but that commit appears to be very old (2016ish) -- the batch.py file supplied with Salt 3003 is very different (https://github.com/saltstack/salt/blob/9242a14ebc204044cc585f337591cf85d997501e/salt/cli/batch.py)

@bzukdatto
Copy link
Author

Is there any more information I can provide here? Not clear on next steps.

@MKLeb MKLeb self-assigned this Jan 19, 2022
@MKLeb
Copy link
Contributor

MKLeb commented Jan 20, 2022

@bzukdatto I have verified this is still an issue with

salt * -b 50 test.ping,test.echo ,nacl

and will start working on a fix, using what you and @waynew have already said as a launching pad.

@MKLeb
Copy link
Contributor

MKLeb commented Jan 25, 2022

Actually, I've confirmed that normal retcodes are also returning incorrectly in normal scenarios. It seems, as @waynew stated, batch.py got messed up somewhere in the past, and I'd like to keep the fix local to that file if able since it really is a problem with how it operates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Phosphorus v3005.0 Release code name and version Regression The issue is a bug that breaks functionality known to work in previous releases. severity-high 2nd top severity, seen by most users, causes major problems
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants