Skip to content

[BUG] Different results from different versions when training DPLR model #2862

@Liu-RX

Description

@Liu-RX

Bug summary

I was trying out the DPLR training example in examples/water/dplr/train, when I came into the following problem. I follwed the instruction in the online doc, i. e., first train Deep Wannier network with dp train dw.json && dp freeze -o dw.pb command. Next, train the energy network with dp train ener.json command.

The training of the Deep Wannier network works well, however, when training the energy model with the ener.json file, the force RMSE kept rising like the lcurve.out file as follows.

#  step      rmse_trn    rmse_e_trn    rmse_f_trn         lr
      0      2.58e+01      1.53e-01      8.15e-01    1.0e-03
    100      1.49e+01      3.80e-02      6.27e-01    5.6e-04
    200      2.28e+01      5.10e-02      1.28e+00    3.2e-04
    300      2.99e+01      1.12e-01      2.23e+00    1.8e-04
    400      2.66e+01      6.64e-02      2.65e+00    1.0e-04
    500      2.52e+01      6.81e-04      3.34e+00    5.6e-05
    600      1.69e+01      9.14e-03      2.95e+00    3.2e-05
    700      2.06e+01      4.41e-02      4.76e+00    1.8e-05
    800      1.39e+01      3.82e-02      4.20e+00    1.0e-05
    900      1.20e+01      5.22e-02      4.64e+00    5.6e-06
   1000      1.61e+01      4.17e-02      7.86e+00    3.2e-06
   1100      1.48e+01      4.30e-02      8.87e+00    1.8e-06
   1200      8.84e+00      3.74e-02      6.23e+00    1.0e-06
   1300      9.89e+00      6.62e-03      7.91e+00    5.6e-07
   1400      1.32e+01      7.61e-02      1.14e+01    3.2e-07
   1500      1.00e+01      6.95e-03      9.21e+00    1.8e-07
   1600      1.10e+01      5.02e-02      1.05e+01    1.0e-07
   1700      9.08e+00      4.07e-03      8.83e+00    5.6e-08
   1800      1.48e+01      1.25e-01      1.44e+01    3.2e-08
   1900      1.46e+01      1.59e-01      1.41e+01    1.8e-08
   2000      1.28e+01      9.54e-02      1.26e+01    1.0e-08

I also tried out the same example on 2.2.0, 2.1.5, and 2.1.0. v2.2.0 gives the same result as above. v2.1.5 and v2.1.0 gives the following result, which looks more reasonable:

#  step      rmse_trn    rmse_e_trn    rmse_f_trn         lr
      0      2.58e+01      1.53e-01      8.15e-01    1.0e-03
    100      1.39e+01      1.48e-01      5.79e-01    5.6e-04
    200      8.65e+00      6.18e-02      4.82e-01    3.2e-04
    300      5.54e+00      4.16e-04      4.14e-01    1.8e-04
    400      3.78e+00      2.64e-02      3.73e-01    1.0e-04
    500      2.77e+00      2.12e-03      3.66e-01    5.6e-05
    600      2.19e+00      8.75e-03      3.82e-01    3.2e-05
    700      1.66e+00      5.43e-03      3.83e-01    1.8e-05
    800      1.37e+00      8.95e-03      4.11e-01    1.0e-05
    900      1.19e+00      1.16e-02      4.54e-01    5.6e-06
   1000      8.69e-01      4.41e-03      4.24e-01    3.2e-06
   1100      7.20e-01      2.93e-03      4.31e-01    1.8e-06
   1200      5.34e-01      2.68e-03      3.76e-01    1.0e-06
   1300      4.79e-01      4.74e-03      3.76e-01    5.6e-07
   1400      5.11e-01      1.14e-02      4.01e-01    3.2e-07
   1500      4.73e-01      3.63e-03      4.31e-01    1.8e-07
   1600      4.71e-01      3.75e-03      4.44e-01    1.0e-07
   1700      3.86e-01      9.01e-03      3.34e-01    5.6e-08
   1800      3.83e-01      9.08e-03      3.34e-01    3.2e-08
   1900      3.74e-01      9.85e-04      3.70e-01    1.8e-08
   2000      3.86e-01      3.99e-03      3.76e-01    1.0e-08

The two results differ largely on the RMSE of forces, so I suspect that a bug was introduced between v2.1.5 and v2.2.0. If not so, I wonder why the resulting RMSE becomes so different between the new and old versions.

DeePMD-kit Version

2.2.4, 2.2.0, 2.1.5, 2.1.0

TensorFlow Version

Default version in the offline packages

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

Described as above.

Steps to Reproduce

Run examples/water/dplr/train case as the documentation directs.

Further Information, Files, and Links

No response

Metadata

Metadata

Assignees

Labels

bugcriticalCritical bugs that may break the results without messagesreproducedThis bug has been reproduced by developers

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions