Closed
Description
We generate the variable name by count, and it's easy to meet error if any one of trainers retries once.
std::string GetGradVarNameForTrainer(const std::string &varname) const {
if (grads_counter_.find(varname) == grads_counter_.end()) {
grads_counter_[varname] = 0;
}
return string::Sprintf("%s.trainer_%d", varname, grads_counter_[varname]++);
}
Solutions:
- Every
ProgramDesc
may distinguish from each other(even they are the same type):- Now the trainer's ProgramDescs are same with each other.But maybe we can generate a unique name and set as an attribute of
send_op
and send gradient with this name.Sopserver
can overwrite gradient if they come from oneProgramDesc
.
- Now the trainer's ProgramDescs are same with each other.But maybe we can generate a unique name and set as an attribute of
Pros:
ProgramDescs
can be stored in a storage.The executor does not care what it is and just execute it.- It's convenient to support fault-tolerance.
- It's convenient to support auto-scaling when we implement
ProgramDesc
dynamic reload
- Every trainer has a unique name:
- One Trainer can get the unique name from etcd or
- Pass a
trainer name
throughenvironment variable
- it looks very redundant:- Since all trainers in Kubernetes
Job
and they don't distinguish from each other, can we create manyJob
resources and in each set
- Since all trainers in Kubernetes
parallelism: 1
completions: 1
...
env:
- name: TRAINER_NAME
value: <trainer_NAME>
Metadata
Metadata
Assignees
Labels
No labels