[Discussion]It's very easy to meet error if any one of trainers retries once

We generate the variable name by count, and it's easy to meet error if any one of trainers retries once.

```
std::string GetGradVarNameForTrainer(const std::string &varname) const {
    if (grads_counter_.find(varname) == grads_counter_.end()) {
      grads_counter_[varname] = 0;
    }
    return string::Sprintf("%s.trainer_%d", varname, grads_counter_[varname]++);
  }
```

### Solutions:  
1. Every `ProgramDesc` may distinguish from each other(even they are the same type):
   - Now the trainer's ProgramDescs are same with each other.But maybe we can generate a unique name and set as an attribute of `send_op` and send gradient with this name.So `pserver` can overwrite gradient if they come from one `ProgramDesc`.

Pros:
- `ProgramDescs` can be stored in a storage.The executor does not care what it is and just execute it.
- It's convenient to support fault-tolerance.
- It's convenient to support auto-scaling when we implement `ProgramDesc` [dynamic reload](https://github.com/PaddlePaddle/Paddle/issues/7934#issuecomment-361362088)

2. Every trainer has a unique name:
   - One Trainer can get the unique name from etcd or 
   - Pass a `trainer name` through `environment variable` - it looks very redundant:
       - Since all trainers in Kubernetes `Job` and they don't distinguish from each other, can we create many `Job` resources and in each set
```
parallelism: 1
completions: 1
...
env:
- name: TRAINER_NAME
   value: <trainer_NAME>
```
   


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion]It's very easy to meet error if any one of trainers retries once #7964

Solutions:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Discussion]It's very easy to meet error if any one of trainers retries once #7964

Description

Solutions:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions