Open
Description
@zzk0 @hjchen2 @mosout Hi, I'm confusing whether SetInputTensors and Execute functions serve for only one request or all the requests simultaneously. If they serve for all the requests, how is parallelism implemented?Could you please give some advice?
// collect input
std::vector<const char*> input_names;
std::vector<oneflow_api::Tensor> input_tensors;
std::vector<BackendMemory*> input_memories;
bool cuda_copy = false;
BackendInputCollector collector(
requests, request_count, &responses, model_state_->TritonMemoryManager(),
model_state_->EnablePinnedInput(), CudaStream());
SetInputTensors(
total_batch_size, requests, request_count, &responses, &collector,
&input_names, &input_tensors, &input_memories, &cuda_copy);
SynchronizeStream(CudaStream(), cuda_copy);
// execute
uint64_t compute_start_ns = 0;
SET_TIMESTAMP(compute_start_ns);
std::vector<oneflow_api::Tensor> output_tensors;
Execute(&responses, request_count, &input_tensors, &output_tensors);
Metadata
Metadata
Assignees
Labels
No labels