-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training with multiple machines #3966
training with multiple machines #3966
Conversation
OK, I assume this is OK to merge? Any objections? |
@@ -28,7 +29,11 @@ def main(): | |||
logging.warning('No GPU detected! Use CPU for inference.') | |||
device = torch.device('cpu') | |||
else: | |||
device = torch.device('cuda', args.device_id) | |||
devices = allocate_gpu_devices(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we follow the kaldi style SelectGpu()
to select a gpu that has the largest available memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to do this with --q
options in queue.pl
, considering GPUs on a single machine nowdays are commonly with same type (and memory) and we suppose they are in Exclusive_Process
mode
@@ -28,7 +29,11 @@ def main(): | |||
logging.warning('No GPU detected! Use CPU for inference.') | |||
device = torch.device('cpu') | |||
else: | |||
device = torch.device('cuda', args.device_id) | |||
devices = allocate_gpu_devices(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the user provides device_id
at the commandline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me do this in another pr, I think what we need is device_ids
instead of device_id
. Actually I don't think about this very clearly now as:
-
Multiple gpus is required when training with multiple GPUs on a single machine, that's why we need
device_ids
-
Assign GPU id when training with multiple machines is relatively complex for uses.
@danpovey any comments about this?
|
99ee14b
to
a0343b6
Compare
It does not help. Thanks, updated the text. We can't say
|
I just feel that it's unnatural to say |
OK, updated. |
Merged, thanks!! |
I have implemented online-ivector in pytorch model, but I get no gain but worse result when using online-ivector. @qindazhu @csukuangfj please review this code.
this is the result.
|
Could you please open a pullrequest so that we can view the whole picture? |
OK |
run_cleanup_segmentation.sh
as it does not make the result better.Result