Skip to content

[BUG] [critical] TF v2.13.0 calculates wrong GPU results #2660

@njzjz

Description

@njzjz

Bug summary

Using the same DeePMD-kit code, TF v2.12.0 works fine, but TF v2.13.0 gives wrong GPU results for forces.

DeePMD-kit Version

v2.2.3.dev55+g37fd8d19

TensorFlow Version

2.13.0

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

Test on examples/water/se_e2_a and compare lcurve.out.

TF v2.12.0 + GPU:

#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
      0      2.60e+01    2.61e+01      6.76e-01    6.76e-01      8.20e-01    8.23e-01    1.0e-03
    100      1.18e+01    1.11e+01      1.90e-01    1.81e-01      3.73e-01    3.50e-01    1.0e-03
    200      7.50e+00    7.34e+00      5.96e-02    5.33e-02      2.37e-01    2.32e-01    1.0e-03

TF v2.13.0 + GPU:

#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
      0           nan    1.08e+03      5.27e+01    6.76e-01      1.28e+06    3.41e+01    1.0e-03
    100      3.20e+02    2.50e+02      5.24e-01    5.15e-01      1.01e+01    7.92e+00    1.0e-03
    200      4.55e+03    5.28e+02      3.60e+01    2.73e-01      1.44e+02    1.67e+01    1.0e-03

TF v2.13.0 + CPU:

#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
      0      2.60e+01    2.61e+01      6.76e-01    6.76e-01      8.20e-01    8.23e-01    1.0e-03
    100      1.18e+01    1.11e+01      1.90e-01    1.81e-01      3.73e-01    3.50e-01    1.0e-03
    200      7.50e+00    7.34e+00      5.96e-02    5.33e-02      2.37e-01    2.32e-01    1.0e-03

TF v2.12.0 + GPU and TF v2.13.0 + CPU give the same results. The rmse_f_trn from TF v2.13.0 + GPU is wrong. I think the reason needs to be looked into.

Steps to Reproduce

Install:

pip install tensorflow==2.13.0
pip install -v .

Run examples:

cd examples/water/se_e2_a
dp train input.json

Further Information, Files, and Links

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugcriticalCritical bugs that may break the results without messagesupstream

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions