Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specific structure for ensemble model may causes deadlock #7280

Open
ukus04 opened this issue May 28, 2024 · 1 comment
Open

Specific structure for ensemble model may causes deadlock #7280

ukus04 opened this issue May 28, 2024 · 1 comment

Comments

@ukus04
Copy link

ukus04 commented May 28, 2024

Description
level=error msg="rpc error: code = InvalidArgument desc = in ensemble 'deadlock_model', [request id: <id_unknown>] unexpected deadlock, at least one output is not set while no more ensemble steps can be made" func=blabla/grpc_proxy.server.callTritonProcedure file="/app/grpc_proxy/grpc_proxy_server.go:179

I sometimes met deadlock logs when getting requests and the frequency for this error is high when the server gets peak traffic. This log is from client that calls the triton model server in other container.

I add the config.pbtxt that is problematic and fixed, but the figure I added will be more intuitive.
image

Here's the problematic config.pbtxt that causes deadlock.

platform:  "ensemble"
max_batch_size:  8
input:  {
  name:  "FEATURES"
  data_type:  TYPE_STRING
  dims:  -1
}
input:  {
  name:  "COLUMN_NAMES"
  data_type:  TYPE_STRING
  dims:  -1
}
output:  {
  name:  "SCORES"
  data_type:  TYPE_FP32
  dims:  -1
}
output:  {
  name:  "SCORES_A"
  data_type:  TYPE_FP32
  dims:  -1
}
output:  {
  name:  "SCORES_B"
  data_type:  TYPE_FP32
  dims:  -1
}
output:  {
  name:  "MODEL_VERSION"
  data_type:  TYPE_STRING
  dims:  -1
}
output:  {
  name:  "MODEL_VERSION_A"
  data_type:  TYPE_STRING
  dims:  -1
}
output:  {
  name:  "MODEL_VERSION_B"
  data_type:  TYPE_STRING
  dims:  -1
}
ensemble_scheduling:  {
  step:  {
    model_name:  "A"
    model_version:  1
    input_map:  {
      key:  "COLUMN_NAMES"
      value:  "COLUMN_NAMES"
    }
    input_map:  {
      key:  "FEATURES"
      value:  "FEATURES"
    }
    output_map:  {
      key:  "MODEL_VERSION_A"
      value:  "MODEL_VERSION_A"
    }
    output_map:  {
      key:  "SCORES_A"
      value:  "SCORES_A"
    }
  }
  step:  {
    model_name:  "B"
    model_version:  1
    input_map:  {
      key:  "COLUMN_NAMES"
      value:  "COLUMN_NAMES"
    }
    input_map:  {
      key:  "FEATURES"
      value:  "FEATURES"
    }
    output_map:  {
      key:  "MODEL_VERSION_B"
      value:  "MODEL_VERSION_B"
    }
    output_map:  {
      key:  "SCORES_B"
      value:  "SCORES_B"
    }
  }
  step:  {
    model_name:  "C"
    model_version:  1
    input_map:  {
      key:  "SCORES_A"
      value:  "SCORES_A"
    }
    input_map:  {
      key:  "SCORES_B"
      value:  "SCORES_B"
    }
    output_map:  {
      key:  "MODEL_VERSION"
      value:  "MODEL_VERSION"
    }
    output_map:  {
      key:  "SCORES"
      value:  "SCORES"
    }
  }
}

Here's the fixed config.pbtxt that doesn't cause deadlock. I just inferred that the main cause is using a model's output as the other models' input "and" the final output. So I just make A, B output not to directly go to final output but to pass C.

platform:  "ensemble"
max_batch_size:  8
input:  {
  name:  "FEATURES"
  data_type:  TYPE_STRING
  dims:  -1
}
input:  {
  name:  "COLUMN_NAMES"
  data_type:  TYPE_STRING
  dims:  -1
}
output:  {
  name:  "SCORES"
  data_type:  TYPE_FP32
  dims:  -1
}
output:  {
  name:  "SCORES_A"
  data_type:  TYPE_FP32
  dims:  -1
}
output:  {
  name:  "SCORES_B"
  data_type:  TYPE_FP32
  dims:  -1
}
output:  {
  name:  "MODEL_VERSION"
  data_type:  TYPE_STRING
  dims:  -1
}
output:  {
  name:  "MODEL_VERSION_A"
  data_type:  TYPE_STRING
  dims:  -1
}
output:  {
  name:  "MODEL_VERSION_B"
  data_type:  TYPE_STRING
  dims:  -1
}
ensemble_scheduling:  {
  step:  {
    model_name:  "A"
    model_version:  1
    input_map:  {
      key:  "COLUMN_NAMES"
      value:  "COLUMN_NAMES"
    }
    input_map:  {
      key:  "FEATURES"
      value:  "FEATURES"
    }
    output_map:  {
      key:  "MODEL_VERSION_A"
      value:  "MODEL_VERSION_A"
    }
    output_map:  {
      key:  "SCORES_A_TEMP"
      value:  "SCORES_A_TEMP"
    }
  }
  step:  {
    model_name:  "B"
    model_version:  1
    input_map:  {
      key:  "COLUMN_NAMES"
      value:  "COLUMN_NAMES"
    }
    input_map:  {
      key:  "FEATURES"
      value:  "FEATURES"
    }
    output_map:  {
      key:  "MODEL_VERSION_B"
      value:  "MODEL_VERSION_B"
    }
    output_map:  {
      key:  "SCORES_B_TEMP"
      value:  "SCORES_B_TEMP"
    }
  }
  step:  {
    model_name:  "C"
    model_version:  1
    input_map:  {
      key:  "SCORES_A_TEMP"
      value:  "SCORES_A_TEMP"
    }
    input_map:  {
      key:  "SCORES_B_TEMP"
      value:  "SCORES_B_TEMP"
    }
    output_map:  {
      key:  "MODEL_VERSION"
      value:  "MODEL_VERSION"
    }
    output_map:  {
      key:  "SCORES"
      value:  "SCORES"
    }
    output_map:  {
      key:  "SCORES_A"
      value:  "SCORES_A"
    }
    output_map:  {
      key:  "SCORES_B"
      value:  "SCORES_B"
    }
  }
}

Triton Information
TRITON_VERSION=2.34.0 TRITON_CONTAINER_VERSION=23.05

I'm using custom build image that includes the backends only I need.

Expected behavior
I think the first config.pbtxt should work well, but it doesn't. I just want to know why the deadlock is caused.

@ukus04 ukus04 changed the title I think specific structure for ensemble model causes deadlock Specific structure for ensemble model may causes deadlock May 28, 2024
@wwx007121
Copy link

I'm encountering the same issue. It seems that system_shard_memory is being allocated using a combination of instance and name.I suspect that leading to a deadlock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants