Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

Subworkflow completes but main workflow is not starting next task #3089

Closed
akash0996 opened this issue Jul 5, 2022 · 25 comments
Closed

Subworkflow completes but main workflow is not starting next task #3089

akash0996 opened this issue Jul 5, 2022 · 25 comments
Labels

Comments

@akash0996
Copy link

Screen Shot 2022-07-06 at 12 55 35 AM

Screen Shot 2022-07-06 at 12 56 14 AM

@jxu-nflx
Copy link
Contributor

jxu-nflx commented Jul 5, 2022

@akash0996 it would be helpful if you can provide the workflow json, and also check conductor server logs for any errors? I assume the next task is a decision task based on the image provided?

@aravindanr
Copy link
Collaborator

Please also include what persistence, queue and lock implementation is used.

@akash0996
Copy link
Author

@jxu-nflx I have attached wfs below

Main WF

{
    "createTime": 1652893391845,
    "name": "BP_URL_QA",
    "description": "test",
    "version": 10,
    "tasks": [
        {
            "name": "Sub_Workflow_URLScan",
            "taskReferenceName": "sub_urlscan",
            "inputParameters": {
                "url": "${workflow.input.cs4}"
            },
            "type": "SUB_WORKFLOW",
            "decisionCases": {},
            "defaultCase": [],
            "forkTasks": [],
            "startDelay": 0,
            "subWorkflowParam": {
                "name": "Submit_to_URLScan",
                "version": 8
            },
            "joinOn": [],
            "optional": false,
            "defaultExclusiveJoinTask": [],
            "asyncComplete": false,
            "loopOver": []
        },
        {
            "name": "Failed_QA",
            "taskReferenceName": "failed_qa",
            "inputParameters": {
                "urlscan_malicious": "${sub_urlscan.output.urlscan_malicious}"
            },
            "type": "DECISION",
            "caseValueParam": "urlscan_malicious",
            "decisionCases": {
                "true": [
                    {
                        "name": "Failed_QA_Targeting_Chase",
                        "taskReferenceName": "failed_qa_targeting_chase",
                        "inputParameters": {
                            "urlscan_brand": "${sub_urlscan.output.urlscan_brands}"
                        },
                        "type": "DECISION",
                        "caseValueParam": "urlscan_brand",
                        "decisionCases": {
                            "chase": [
                                {
                                    "name": "Submit_Failed_QA_To_HF",
                                    "taskReferenceName": "qa_failure_to_HF",
                                    "inputParameters": {
                                        "cs1": "${workflow.input.cs1}",
                                        "cs2": "${workflow.input.cs2}",
                                        "cs3": "${workflow.input.cs3}",
                                        "cs4": "${workflow.input.cs4}",
                                        "alert_time": "${workflow.input.alert_time}",
                                        "description": "${workflow.input.QA_Reason} UUID: ${workflow.input.UUID}. ${sub_urlscan.output.urlscan_report}"
                                    },
                                    "type": "SUB_WORKFLOW",
                                    "decisionCases": {},
                                    "defaultCase": [],
                                    "forkTasks": [],
                                    "startDelay": 0,
                                    "subWorkflowParam": {
                                        "name": "Submit_to_PL_HF",
                                        "version": 11
                                    },
                                    "joinOn": [],
                                    "optional": false,
                                    "defaultExclusiveJoinTask": [],
                                    "asyncComplete": false,
                                    "loopOver": []
                                }
                            ]
                        },
                        "defaultCase": [],
                        "forkTasks": [],
                        "startDelay": 0,
                        "joinOn": [],
                        "optional": false,
                        "defaultExclusiveJoinTask": [],
                        "asyncComplete": false,
                        "loopOver": []
                    }
                ]
            },
            "defaultCase": [],
            "forkTasks": [],
            "startDelay": 0,
            "joinOn": [],
            "optional": false,
            "defaultExclusiveJoinTask": [],
            "asyncComplete": false,
            "loopOver": []
        }
    ],
    "inputParameters": [],
    "outputParameters": {},
    "schemaVersion": 2,
    "restartable": true,
    "workflowStatusListenerEnabled": false,
    "ownerEmail": "redacted@test.com",
    "timeoutPolicy": "ALERT_ONLY",
    "timeoutSeconds": 0,
    "variables": {},
    "inputTemplate": {}
}

Sub WF

{
  "updateTime": 1656600859753,
  "name": "Submit_to_URLScan",
  "description": "test",
  "version": 8,
  "tasks": [
    {
      "name": "Submit_to_URLScan",
      "taskReferenceName": "urlscan_submit",
      "inputParameters": {
        "submission_url": "${workflow.input.url}"
      },
      "type": "SIMPLE",
      "decisionCases": {},
      "defaultCase": [],
      "forkTasks": [],
      "startDelay": 0,
      "joinOn": [],
      "optional": false,
      "taskDefinition": {
        "createTime": 1652893521877,
        "createdBy": "",
        "name": "Submit_to_URLScan",
        "retryCount": 3,
        "timeoutSeconds": 1200,
        "inputKeys": [
          "submission_url"
        ],
        "outputKeys": [
          "status_code",
          "response"
        ],
        "timeoutPolicy": "TIME_OUT_WF",
        "retryLogic": "FIXED",
        "retryDelaySeconds": 10,
        "responseTimeoutSeconds": 1000,
        "concurrentExecLimit": 100,
        "inputTemplate": {},
        "rateLimitPerFrequency": 50,
        "rateLimitFrequencyInSeconds": 1,
        "ownerEmail": "redacted@test.com",
        "pollTimeoutSeconds": 3600
      },
      "defaultExclusiveJoinTask": [],
      "asyncComplete": false,
      "loopOver": []
    },
    {
      "name": "URLScan_Response",
      "taskReferenceName": "urlscan_response",
      "inputParameters": {
        "submission_url": "${urlscan_submit.output.response}"
      },
      "type": "SIMPLE",
      "decisionCases": {},
      "defaultCase": [],
      "forkTasks": [],
      "startDelay": 0,
      "joinOn": [],
      "optional": false,
      "taskDefinition": {
        "createTime": 1652893496877,
        "createdBy": "",
        "name": "URLScan_Response",
        "retryCount": 3,
        "timeoutSeconds": 1200,
        "inputKeys": [
          "urlscan_url"
        ],
        "outputKeys": [
          "urlscan_analysis",
          "urlscan_malicious",
          "urlscan_brands",
          "urlscan_report",
          "source",
          "index",
          "host",
          "event",
          "sourcetype"
        ],
        "timeoutPolicy": "TIME_OUT_WF",
        "retryLogic": "FIXED",
        "retryDelaySeconds": 10,
        "responseTimeoutSeconds": 1000,
        "concurrentExecLimit": 100,
        "inputTemplate": {},
        "rateLimitPerFrequency": 50,
        "rateLimitFrequencyInSeconds": 1,
        "ownerEmail": "redacted@test.com",
        "pollTimeoutSeconds": 3600
      },
      "defaultExclusiveJoinTask": [],
      "asyncComplete": false,
      "loopOver": []
    },
    {
      "name": "splunk_post_data",
      "taskReferenceName": "splunk_post_data_1",
      "inputParameters": {
        "source": "${urlscan_response.output.source}",
        "index": "${urlscan_response.output.index}",
        "host": "${urlscan_response.output.host}",
        "event": "${urlscan_response.output.event}",
        "sourcetype": "${urlscan_response.output.sourcetype}"
      },
      "type": "SIMPLE",
      "decisionCases": {},
      "defaultCase": [],
      "forkTasks": [],
      "startDelay": 0,
      "joinOn": [],
      "optional": false,
      "taskDefinition": {
        "createTime": 1648148740831,
        "createdBy": "",
        "name": "splunk_post_data",
        "description": "Post data to Splunk",
        "retryCount": 3,
        "timeoutSeconds": 1200,
        "inputKeys": [
          "data",
          "index",
          "host",
          "source_type",
          "source"
        ],
        "outputKeys": [],
        "timeoutPolicy": "TIME_OUT_WF",
        "retryLogic": "FIXED",
        "retryDelaySeconds": 600,
        "responseTimeoutSeconds": 1200,
        "inputTemplate": {},
        "rateLimitPerFrequency": 0,
        "rateLimitFrequencyInSeconds": 1,
        "ownerEmail": "my_owner_email@test.com",
        "pollTimeoutSeconds": 3600
      },
      "defaultExclusiveJoinTask": [],
      "asyncComplete": false,
      "loopOver": []
    }
  ],
  "inputParameters": [],
  "outputParameters": {
    "urlscan_malicious": "${urlscan_response.output.urlscan_malicious}",
    "urlscan_brands": "${urlscan_response.output.urlscan_brands}",
    "urlscan_report": "${urlscan_response.output.urlscan_report}"
  },
  "schemaVersion": 2,
  "restartable": true,
  "workflowStatusListenerEnabled": false,
  "ownerEmail": "redacted@test.com",
  "timeoutPolicy": "ALERT_ONLY",
  "timeoutSeconds": 0,
  "variables": {},
  "inputTemplate": {}
}

@hebrd
Copy link

hebrd commented Jul 14, 2022

same issue happened several times,usually after retrying failed subworkflow.
context:

  • persistence: mysql
  • queue: mysql
  • lock: zookeeper

It keeps going after restarting conductor server

@hebrd
Copy link

hebrd commented Jul 18, 2022

@jxu-nflx Can you reproduce this issue or any suggestions to avoid this serious problem? have not enough time to read through all related codes

@akash0996 I have trouble in reproducing this issue, can you? any information would be helpful for me to reproduce and try to fix this problem

@manan164
Copy link
Contributor

Hi @BrandonDotLin , Can you please check if workflow sweeper is running or not? Ideally async system task should be polled every 60 seconds.

@hebrd
Copy link

hebrd commented Jul 19, 2022

Hi @BrandonDotLin , Can you please check if workflow sweeper is running or not? Ideally async system task should be polled every 60 seconds.

Yes, with following config:

conductor.app.sweeperThreadCount=20
conductor.app.seepFrequency=10

@hebrd
Copy link

hebrd commented Aug 11, 2022

any updates for this issue? @aravindanr

@apanicker-nflx
Copy link
Collaborator

We are looking into a fix for this issue and will release a fix soon.

@hebrd
Copy link

hebrd commented Aug 16, 2022

Great. Can you reproduce this issue so far? Would you please share if you did. When this happened, the deciderQueue message was popped but nothing happened with the workflow, everything worked well after I change the popped flag to 0 (mysql persistence)

@saradhi-atluri
Copy link

do we have any update on this issue?

@aravindanr
Copy link
Collaborator

When this happened, the deciderQueue message was popped but nothing happened with the workflow, everything worked well after I change the popped flag to 0 (mysql persistence)

This seems like an issue with conductor-mysql-persistence module (maintained by conductor-community).

There was a similar issue (#3183) with sub_workflow task for which a fix was released in v3.10.7.

@hebrd
Copy link

hebrd commented Aug 23, 2022

When this happened, the deciderQueue message was popped but nothing happened with the workflow, everything worked well after I change the popped flag to 0 (mysql persistence)

This seems like an issue with conductor-mysql-persistence module (maintained by conductor-community).

There was a similar issue (#3183) with sub_workflow task for which a fix was released in v3.10.7.

This issue(not start next task) can happen even current finished task is not sub-workflow

@azriel46d
Copy link

After upgrading to v3.10.7 with Postgres persistence 3.10.5 I have execution hanging in the same manner as described above after switch nodes.
The switch nodes evaluate the result correctly. The defaultCase is chosen (which in this case has zero nodes) and the next node should be picked up. Yet this is never executed. So far managed to replicate on two different flows.

image
image

Server logs do not yield anything particular.
Changing the polling to false on queue-messages in Postgres , does not trigger the workflow to continue.

Is there a way where the next task can be 'forced'?

@jxu-nflx
Copy link
Contributor

After upgrading to v3.10.7 with Postgres persistence 3.10.5 I have execution hanging in the same manner as described above after switch nodes. The switch nodes evaluate the result correctly. The defaultCase is chosen (which in this case has zero nodes) and the next node should be picked up. Yet this is never executed. So far managed to replicate on two different flows.

image image

Server logs do not yield anything particular. Changing the polling to false on queue-messages in Postgres , does not trigger the workflow to continue.

Is there a way where the next task can be 'forced'?

this issue is fixed here: #3197

@hebrd
Copy link

hebrd commented Sep 2, 2022

Hi @apanicker-nflx , sorry to mention you again. Would you please give some instructions to reproduce issue so I can try to fix it myself? We need to fix this problem asap

@apanicker-nflx
Copy link
Collaborator

@BrandonDotLin We came across this issue when running regression tests at scale. The likely cause was attributed to a race condition between updating subworkflow task in main workflow and the subworkflow itself completing. We have since fixed this race condition with the latest fix in v3.11.3

@hebrd
Copy link

hebrd commented Sep 13, 2022

@BrandonDotLin We came across this issue when running regression tests at scale. The likely cause was attributed to a race condition between updating subworkflow task in main workflow and the subworkflow itself completing. We have since fixed this race condition with the latest fix in v3.11.3

would you please advise which part of changes in v3.11.3 was related to race condition fix? It seems all about output of subworkflow.

@apanicker-nflx
Copy link
Collaborator

Part of the changes were made in v3.10.7 and part of the changes are in v3.11.3.
Basically, changes were made to the subworkflow execution and to repair the subworkflow task in parent post-facto.

@autodidactic
Copy link

we ave created a simple workflow with version 3.11.3, it turns out the output of the sub workflow is not being fed back to main flow. more specifically the join does not resume after prev fork tasks are completed..

here is a JSON

{
"taskType": "FORK",
"status": "COMPLETED",
"inputData": {},
"referenceTaskName": "get_details_fork_ref",
"retryCount": 0,
"seq": 1,
"pollCount": 0,
"taskDefName": "FORK",
"scheduledTime": 1666034254598,
"startTime": 1666034253907,
"endTime": 1666034253907,
"updateTime": 1666034255716,
"startDelayInSeconds": 0,
"retried": false,
"executed": true,
"callbackFromWorker": true,
"responseTimeoutSeconds": 0,
"workflowInstanceId": "4417b426-9857-4a5b-b75a-eba8e784a9f0",
"workflowType": "Get_Details_By_SID_Workflow",
"taskId": "1d1fa9b1-f7cc-40e8-be17-8d1fc033a469",
"callbackAfterSeconds": 0,
"outputData": {},
"workflowTask": {
"name": "get_details_fork",
"taskReferenceName": "get_details_fork_ref",
"inputParameters": {},
"type": "FORK_JOIN",
"decisionCases": {},
"defaultCase": [],
"forkTasks": [
[
{
"name": "Get_Details_By_SID_SUB01",
"taskReferenceName": "task_by_sid_ref_sub01",
"inputParameters": {
"sid": "${workflow.input.sid}"
},
"type": "SUB_WORKFLOW",
"decisionCases": {},
"defaultCase": [],
"forkTasks": [],
"startDelay": 0,
"subWorkflowParam": {
"name": "get_details_sid_sub",
"version": 1
},
"joinOn": [],
"optional": false,
"defaultExclusiveJoinTask": [],
"asyncComplete": false,
"loopOver": []
}
],
[
{
"name": "Get_Details_By_SID_SUB02",
"taskReferenceName": "task_by_sid_ref_sub02",
"inputParameters": {
"sid": "${workflow.input.sid}"
},
"type": "SUB_WORKFLOW",
"decisionCases": {},
"defaultCase": [],
"forkTasks": [],
"startDelay": 0,
"subWorkflowParam": {
"name": "get_details_sid_sub",
"version": 1
},
"joinOn": [],
"optional": false,
"defaultExclusiveJoinTask": [],
"asyncComplete": false,
"loopOver": []
}
]
],
"startDelay": 0,
"joinOn": [],
"optional": false,
"defaultExclusiveJoinTask": [],
"asyncComplete": false,
"loopOver": []
},
"rateLimitPerFrequency": 0,
"rateLimitFrequencyInSeconds": 0,
"workflowPriority": 0,
"iteration": 0,
"subworkflowChanged": false,
"taskDefinition": null,
"queueWaitTime": -691,
"loopOverTask": false
}

may be you can let us know what configuration we are missing

@autodidactic
Copy link

@apanicker-nflx - can you paste an example json of the problem in question and how this version 3.11.3 fixed the race around condition

thank you

@apanicker-nflx
Copy link
Collaborator

more specifically the join does not resume after prev fork tasks are completed.

JOIN task does not complete immediately, rather will be evaluated asynchronously by the workflow reconciler.

can you paste an example json of the problem in question

Unfortunately, I do not.

how this version 3.11.3

This version along with some other changes in an earlier version fixed the consequences of race conditions. In this specific case, the workflow repair service was tasked with identifying and fixing cases where the subworkflow status/output was not reflected correctly in the parent workflow's subworkflow task.

@autodidactic
Copy link

"joinOn":[
0:"task_by_sid_ref_sub01"
1:"task_by_sid_ref_sub02"
]

in my case the join is contingent on task 0 and task1 . does that means if the tasks are complete the join will evaluate to "TRUE"?
and if one of them is "inprogress " or "failed" join always evaluates to "FALSE". is there a parameters for workflow reconciler to force it to check?

@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2022

This issue is stale, because it has been open for 45 days with no activity. Remove the stale label or comment, or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Dec 3, 2022
@github-actions
Copy link
Contributor

This issue was closed, because it has been stalled for 7 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

9 participants