Skip to content

[Discussion]It's very easy to meet error if any one of trainers retries once #7964

Closed
@gongweibao

Description

@gongweibao

We generate the variable name by count, and it's easy to meet error if any one of trainers retries once.

std::string GetGradVarNameForTrainer(const std::string &varname) const {
    if (grads_counter_.find(varname) == grads_counter_.end()) {
      grads_counter_[varname] = 0;
    }
    return string::Sprintf("%s.trainer_%d", varname, grads_counter_[varname]++);
  }

Solutions:

  1. Every ProgramDesc may distinguish from each other(even they are the same type):
    • Now the trainer's ProgramDescs are same with each other.But maybe we can generate a unique name and set as an attribute of send_op and send gradient with this name.So pserver can overwrite gradient if they come from one ProgramDesc.

Pros:

  • ProgramDescs can be stored in a storage.The executor does not care what it is and just execute it.
  • It's convenient to support fault-tolerance.
  • It's convenient to support auto-scaling when we implement ProgramDesc dynamic reload
  1. Every trainer has a unique name:
    • One Trainer can get the unique name from etcd or
    • Pass a trainer name through environment variable - it looks very redundant:
      • Since all trainers in Kubernetes Job and they don't distinguish from each other, can we create many Job resources and in each set
parallelism: 1
completions: 1
...
env:
- name: TRAINER_NAME
   value: <trainer_NAME>

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions