-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design doc: save model in cluster training. #2655
Conversation
3a1caba
to
6c1a314
Compare
|
||
There are two types of model: dense (e.g., weight for a | ||
fully-connected layer) and sparse model (e.g., word | ||
embedding). Pservers always jointly have the entire model at any given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
word embedding
is not sparse model
, when input training data is sparse and user configures the parameter to be sparse, the trainer then will detect which part of the parameter should be updated in this batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacquesqiao Thanks for pointing out! Does it mean it's a sparse model only if the user configures the parameter to be sparse, and the input for the calculation involving the parse parameter must be sparse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think language model
is sparse model
, word emebedding
definitely is one kind of sparse model
. To be clear, maybe sparse update
is more accurate than sparse model
.
One training instance does need the whole parameters, which means that is sparse update
one.
Thanks! Will change to "sparse update"
-- Helin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consult @lcy-seso , she tells me that sparse model
is a kind of training method to make the model sparse, it's different then sparse update
.language model
and word emebedding
both have no relation with sparse model
. And mostly we will not use sparse model
, so we can just use sparse update
in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Will change to "sparse update".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
The model is the output of the training process. There are two | ||
ways from which user can obtain a model: | ||
|
||
- Save model triggered by user code: user code asks PaddlePaddle to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are saving model in trainer, there's no "asks PaddlePaddle" to do something, which is likely a remote API call. May be changed to "user code can save model by themselves when batch finishes or pass finishes."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero Actually it depends on if we implement a method for saving model, or let user save model from the parameters by himself. Can you take a look at #2655 (comment) ?
|
||
There are two types of model: dense (e.g., weight for a | ||
fully-connected layer) and sparse model (e.g., word | ||
embedding). Pservers always jointly have the entire model at any given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think language model
is sparse model
, word emebedding
definitely is one kind of sparse model
. To be clear, maybe sparse update
is more accurate than sparse model
.
One training instance does need the whole parameters, which means that is sparse update
one.
Thanks! Will change to "sparse update"
-- Helin
- Convert model from the snapshot: model being converted from | ||
pservers' periodic snapshot. In this way, the user can cancel a job | ||
at any time, and still have a relatively fresh model (we snapshot | ||
around every 5 minutes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can overemphasize snapshot
with model snapshot
here, otherwise, someone may confuse with the checkpoint
if he doesn't take a look at the checkpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! by "snapshot", I meant "checkpoint". Will change to "checkpoint"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
6c1a314
to
5157ba6
Compare
dense model, but only have a fraction of the sparse model at any given | ||
time. | ||
|
||
#### Pservers Saving Model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a short discussion with @dzhwinter, we think saving the snapshot at pserver side is needed for recovering the state of pservers. A pserver snapshot should contain not only parameters but also some status like optimizer internels.
Saving snapshot by pserver can be triggered by call Save()
RPC call from trainers to pservers. Trainers can save models by parameter.to_tar()
in event_handlers
.
Pserver side "snapshot" will only be used for pserver recovery, but trainer side saved model can be used for inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
saving the snapshot at pserver side is needed for recovering the state of pservers. A pserver snapshot should contain not only parameters but also some status like optimizer internels.
Agree!
Saving snapshot by pserver can be triggered by call Save() RPC call from trainers to pservers.
I think we can add that if we find it necessary later. But for the first version I am inclining to just let the pservers save periodically.
Trainers can save models by parameter.to_tar() in event_handlers.
I think trainer need Python function called save_model(save_dir)
, in the Python function it will first ask trainer client if it is elected to save the model, save the model if elected. Otherwise every trainer will try to save the model, putting too much burden to the distributed FS.
Pserver side "snapshot" will only be used for pserver recovery, but trainer side saved model can be used for inference.
Agree!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same ideas as we are. 👍
at any time, and still have a relatively fresh model (we snapshot | ||
around every 5 minutes). | ||
|
||
### Save Model Triggered by User Code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seems that this section descript the reason why we choice trainer saving model
, so how about modify the title From
Save Model Triggered by User Code
To
Trainer Saving Model vs. Pservers Saving Model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
Each trainer will be given the directory to save the model. The | ||
elected trainer will save the model to | ||
`given-directory/trainerID`. Since the tainerID is unique, this would |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it happens with split-brain, maybe there are two trainers will save the model to given-directory/00001/pass-0-*
and given-directory/00002/pass-0-*
, but which one we will choose to recover?
How about add file lock under the path given-directory/
and the trainers will save the model only it can get the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Great question.
I think the save model is not for recovery (checkpoint is for recovery), it's for user to use for inference, or as a initial model to train from.
Back to the question, if user want to initialize the training from a saved model, users will specify the path by themselves. We don't need to decide for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I got it.
喜欢这样像剥洋葱一样把道理一层一层说得很明白的design doc! |
Fixes: #2638 #2658