Skip to content

Conversation

@njzjz
Copy link
Member

@njzjz njzjz commented Jan 27, 2024

Set the default save_ckpt to model.ckpt as the prefix. When saving checkpoints, model.ckpt-100.pt will be saved, and model.ckpt.pt will be symlinked to model.ckpt-100.pt. A checkpoint file will be dedicated to record model.ckpt-100.pt.

This keeps the same behavior as the TF backend. One can do the below using the PT backend just like the TF backend:

dp --pt train input.json
# one can cancel the training before it finishes
dp --pt freeze

Set the default save_ckpt to `model.ckpt` as the prefix. When saving checkpoints, `model.ckpt-100.pt` will be saved, and `model.ckpt.pt` will be symlinked to `model.ckpt-100.pt`. A `checkpoint` file will be saved to record `model.ckpt-100.pt`.

This keeps the same behavior as the TF backend.

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
try:
# remove old one
os.remove(new_ff)
except OSError:

Check notice

Code scanning / CodeQL

Empty except

'except' clause does nothing but pass and there is no explanatory comment.
Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
@codecov
Copy link

codecov bot commented Jan 27, 2024

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (3e4715f) 74.27% compared to head (968ae48) 74.27%.

Files Patch % Lines
deepmd/pt/entrypoints/main.py 0.00% 3 Missing ⚠️
deepmd/common.py 93.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            devel    #3191      +/-   ##
==========================================
- Coverage   74.27%   74.27%   -0.01%     
==========================================
  Files         343      343              
  Lines       31629    31634       +5     
  Branches     1592     1592              
==========================================
+ Hits        23494    23497       +3     
- Misses       7210     7212       +2     
  Partials      925      925              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
@wanghan-iapcm wanghan-iapcm merged commit a8168b5 into deepmodeling:devel Jan 28, 2024
@njzjz njzjz mentioned this pull request Apr 2, 2024
@thangckt
Copy link

thangckt commented May 3, 2024

hi @njzjz

Can I know why you need different file extension .pth and .pt when using pytorch?

The files *.pt are generated when run

dp --pt train input.json

and the file *.pth when run

dp --pt freeze

can we just use one of these ext for convenient when collect files in dpegen?

@njzjz
Copy link
Member Author

njzjz commented May 3, 2024

No control flow is saved in the checkpoint file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants