Closed
Description
Is ParallelConfig.pipeline_parallel_size used on multiple gpu cards? Can it be set to the number of GPU cards? Does it relate to processing multiple prompts and generating multiple results in parallel? For example, if there are 2 gpu cards and 7 requests, will it distribute the 7 requests simultaneously to the 2 gpu cards? How is the allocation done? Also, what do the parameters "max_num_batched_tokens" and "max_num_seqs" represent in SchedulerConfig? How can I set it to preserve longer context?
Metadata
Metadata
Assignees
Labels
No labels