Skip to content

Concurrency policy fails when set to 1 #4481

Closed
@djha-skin

Description

@djha-skin
SUMMARY

I have created a very simple orquesta workflow, created solely to be able to set a concurrency policy over a particular set of actions.

I have set a concurrency policy on this workflow, and set the threshold to 1, hoping to mutex the workflow so that it is only run one at a time.

Unfortunately, this policy does not work, due to what looks like a race condition.

ISSUE TYPE
  • Bug Report
STACKSTORM VERSION

st2 --version:

st2 2.10.0, on Python 2.7.5
OS / ENVIRONMENT / INSTALL METHOD

I have a two-node, active-active, highly-available stackstorm cluster. The cluster is set up with everything running on every node: the garbage collector, the scheduler, the runners, the web page, everything. The web page and API layers sit behind a load balancer.

I do NOT have mistral installed, or st2chatops.

Both nodes run on RHEL 7. The mongo database runs in its own cluster, as does rabbitmq. is on its own cluster of VMs. In addition, I have set up zookeeper as a coordination server.

I have overcome most of the challenges with this setup. It works well for me, but in the interest of transparency I have presented it here.

STEPS TO REPRODUCE

Before beginning, I just wanted to say that this is a rough sketch of how to set up something that looks like my HA setup, hopefully enough so that the error can be reproduced, but it's not perfect.

  1. Create a workflow that looks like this:
PS C:\Users\me\Development\src\st2-pack> cat .\actions\workflows\my-task.yaml
version: 1.0

description: Workflow to run and set concurrency of command to '1'

input:
  - cmd
  - timeout

tasks:
  task1:
    action: core.local cmd=<% ctx(cmd) %> timeout=<% ctx(timeout) %>
    next:
      - when: <% succeeded() %>
        publish:
          - stdout: <% result().stdout %>
          - stderr: <% result().stderr %>

output:
  - stdout: <% ctx(stdout) %>
  1. Set a concurrency policy on it:
PS C:\Users\me\Development\src\st2-pack> cat .\policies\ccurrency_my_task.yaml
---
name: isteo_vmp.concurrency
description: Limits the concurrent executions of the my_pack.my-task action.
enabled: true
resource_ref: my_pack.my-task
policy_type: action.concurrency
parameters:
    threshold: 1
    action: delay
  1. Create a zookeeper backend and point stackstorm at it for its coordination backend. This can be done, for example, by using the following docker-compose file. Place it at the root of a single stackstorm VM, run docker-compose up, and you'll have a zookeeper instance at localhost:2181 that you can use:
PS C:\Users\me\Development\src\st2-pack> cat .\docker-compose.yaml
version: '2'

services:
  zookeeper:
    image: 'bitnami/zookeeper:latest'
    environment:
      - ALLOW_ANONYMOUS_LOGIN=yes
    ports:
      - '2181:2181'

Change /etc/st2/st2.conf to use the zookeeper backend:

[coordination]
url = kazoo://sszookeeper.veriskdom.com:2181
  1. Set debug = True in st2.conf:
[system]
debug = True
  1. Restart stackstorm to pick up the configuration changes.

  2. Run five of the workflows all at once:

$ st2 run my_pack.my-task cmd='sleep 90' & st2 run my_pack.my-task cmd='sleep 90' & st2 run my_pack.my-task cmd='sleep 90' & st2 run my_pack.my-task cmd='sleep 90' & st2 run my_pack.my-task cmd='sleep 90'

Notable differences between the above step and my setup are that my setup creates five tasks all at once because a sensor creates five triggers within the same few milliseconds and they're all processed very rapidly, but the outcome should hopefully be the same.

EXPECTED RESULTS

When I print out the executions using st2 execution list, I should see one running task and five delayed tasks.

ACTUAL RESULTS

There were several cases when more than one task was scheduled at the same time.

I have captured log output from my real servers, and I am posting it here:
https://gist.github.com/djhaskin987/de002db96a27991aff8f46c0fd339fd4

The log file is a composite of the /var/log/st2/st2scheduler.log files from both HA servers.
The files were cat-ed together, then sorted by line content, which nicely made the lines
appear in chronological order.

I had to scrub much of the content of those logs, but essentially they look
like what you might find by running the steps to reproduce above. I changed the
name of the pack in the logs to my_pack, the name of the workflow to
my-task, etc. I also changed the cmd argument of the workflow to sleep 90, since the contents of the command line have company-specific information
in it, but it works well enough since the command in question takes roughly 90
seconds to complete. Finally, I replaced the liveaction IDs with
first_task_run, second_task_run, etc. to match the runs as if they were
done in the steps to reproduce above and so that they're easy to identify.

Here are the juicy bits that you'll find in that log file.

The second_task_run is the first to ask to be scheduled, and stackstorm
correctly identifies that it ought to be:

2018-12-18 23:18:56,708 140672270229168 DEBUG concurrency [-] There are 0 instances of my_pack.my-task in scheduled or running status. Threshold of my_pack.my_pack.concurrency is not reached. Action execution will be scheduled.

Next, fourth_task_run asks to be scheduled. Here's the error, though: fourth_task_run is erroneously scheduled because stackstorm is unable to read out what it just wrote to the mongo database. This line should say There are 1 instances of my_pack.my-task...Threshold...is reached but it doesn't.

2018-12-18 23:18:56,891 140615979540144 DEBUG concurrency [-] There are 0 instances of my_pack.my-task in scheduled or running status. Threshold of my_pack.my_pack.concurrency is not reached. Action execution will be scheduled.

Note that I don't know which server created which log entry (that's my bad, sorry :( ). It could be that one st2scheduler process on one of the nodes created the first and another created the second. I suspect this, but haven't tried proving it yet.

Note also that the two conflicting log entries posted above occur within 83 milliseconds of each othe, or less than a tenth of a second 183
milliseconds of each other.

Finally, first_task_run asks to be scheduled. This time, the database read reflects the values correctly and stackstorm sees that two tasks were already scheduled, and so delays the execution of first_task_run.

2018-12-18 23:18:58,112 140615979540304 DEBUG concurrency [-] There are 2 instances of my_pack.my-task in scheduled or running status. Threshold of my_pack.my_pack.concurrency is reached. Action execution will be delayed.
Likely causes

The problem appears to originate in the _apply_before function, found here.

        scheduled = action_access.LiveAction.count(
            action=target.action, status=action_constants.LIVEACTION_STATUS_SCHEDULED)

        # Get the count of running instances of the action.
        running = action_access.LiveAction.count(
            action=target.action, status=action_constants.LIVEACTION_STATUS_RUNNING)

The above lines query the database for scheduled and requested actions, and those calls return erroneous data.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions