❓ Training energies with variable number of atoms across frames #137

peppe69 · 2022-01-19T13:54:35Z

peppe69
Jan 19, 2022

Training a model with xyz dataset

Hi,
we are trying to train a model using a dataset in the format extxyz, which contains cells of bcc iron.
Here a sample of a cell:

54
Lattice="8.5008 0.0 0.0 0.0 8.5008 0.0 0.0 0.0 8.5008" Properties=species:S:1:pos:R:3:forces:R:3:Z:I:1 config_type=phonons_54_high config_name=bcc_bulk_54_0000 ecutwfc=1224.51225528 pbc="T T T" kpoints="4 4 4" degauss=0.136056917253 energy=-186887.234986
Fe     -0.00607641       0.00230075      -0.02907879      -0.73492052      -0.02561024       0.61799986       26 
Fe      1.32012806       1.31778495       1.44496509       0.44297176       1.18663711      -0.28337233       26 
Fe     -0.01845451      -0.12003023       2.89408200       0.78139089       0.95602536      -0.81690734       26 
...
(each row contains position and forces per atom; only the first 3 out of 54 atoms are shown)

Since cells of different dimensions are present (mainly 1, 54 and 128 atoms cells), energies are very different.
We obtain a validation MAE on e/N of about hundreds of meV at the end of the training, and bad performance when using the model to predict atomic properties.
The problem vanishes if we limit training to cells with the same dimension. It is related to the different sizes of the cells? Maybe is it necessary to standardize the energies before?
Attached is our configuration file: test.txt. It is a variant of minimal_eng.yaml as we are training only on energies for now.
Which is the correct setup of the configuration file to use in this case?
test.txt

Answered by peppe69

Feb 16, 2022

Hi,
finally we succeeded in training a nequip model with our data, and with energies and forces.
But maybe we found a bug in the nequip code, so please check carefully what follows.
In detail, the per-atom energy statistics for the whole dataset are: mean=-3460.8266742392325; std=0.16236037667479927. The same evaluated by the nequip are: dataset_per_atom_total_energy_mean=-22883.892153712808, dataset_per_atom_total_energy_std=62120.49343479028
So we debugged the code, and found this: in nequip/data/dataset.py, line 540, the per-atom energies are evaluated as arr / N.
Since the shape of the tensors is different (arr: [n_samples, 1]; N: [n_samples]), element-wise division is NOT performed: …

View full answer

Linux-cpp-lisp · 2022-01-19T16:51:47Z

Linux-cpp-lisp
Jan 19, 2022
Maintainer

Hi @peppe69,

Thanks for reaching out— one thing that leaps out immediately is that you aren't using rescaling:

model_builders:
  - EnergyModel
#  - PerSpeciesRescale
#  - RescaleEnergyEtc

Unless your data is already normalized (and even in that case), this is not going to work out well— we strongly recommend the default setup:

model_builders:
  - EnergyModel
  - PerSpeciesRescale
  - RescaleEnergyEtc

with default settings (please make sure that you are using the latest stable NequIP). The default settings will give you a model whose predicted energy is size-extensive, which will be very important for your variable-sized data.

Second (while not necessary), we have found that training jointly on forces and energies is very helpful even if you don't need the forces from your model. Since you seem to have force data in your training set, you might consider this.

I'm also cc'ing my colleague @simonbatzner, who is most familiar with hyperparameters for these models.

2 replies

mariummou Jul 17, 2024

Hello,
Sorry for commenting here. I have a data of different atom number. I am getting error: _ValueError: Found a None in the provided data objects for batching in key stress

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/usrapps/dwb/mmou/nequip/bin/nequip-train", line 8, in
sys.exit(main())
^^^^^^
File "/usr/local/usrapps/dwb/mmou/nequip/lib/python3.12/site-packages/nequip/scripts/train.py", line 83, in main
trainer = fresh_start(config)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/usrapps/dwb/mmou/nequip/lib/python3.12/site-packages/nequip/scripts/train.py", line 196, in fresh_start
dataset = dataset_from_config(config, prefix="dataset")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/usrapps/dwb/mmou/nequip/lib/python3.12/site-packages/nequip/data/build.py", line 78, in dataset_from_config
instance, _ = instantiate(
^^^^^^^^^^^^
File "/usr/local/usrapps/dwb/mmou/nequip/lib/python3.12/site-packages/nequip/utils/auto_init.py", line 245, in instantiate
raise RuntimeError(
RuntimeError: Failed to build object with prefix dataset using builder ASEDataset
do I need to have some number of atoms? Or i am missing something. any suggestion would be helpful,thanks
16
Lattice="5.885510724 1.04413177 -0.0 -5.655489276 7.707331227 -0.0 3.616978552 -2.08826354 6.687" Properties=species:S:1:pos:R:3:forces:R:3 stress="-0.40495150258530094 -0.0849967963781687 0.25249470922750344 -0.0849967963781687 -0.07775347510550817 -0.04172617995550269 0.25249470922750344 -0.04172617995550269 -0.6266741737241558" free_energy=-31.96450481 energy=-31.96450481 pbc="T T T"
Ba 0.17788000 2.18233000 1.78950000 -16.02805800 -6.67820100 -2.24668400
Ba 3.96259000 -0.11121000 4.62618000 -28.80804900 -6.75819600 103.44914800
Ba 2.42760000 2.92590000 1.60337000 22.74640400 1.90358100 -12.37054700
Ba 6.99133000 0.41177000 5.36172000 0.26549100 0.19799200 4.10083600
Ba -2.76777000 6.35008000 2.04950000 16.59548700 -3.29609200 -21.52321000
Ba 0.69294000 4.24383000 4.59197000 1.00105800 10.97409700 8.55905900
Ba 0.27908000 6.64526000 2.05330000 -1.44788000 -2.98929100 -6.13623100
Ba 4.96556000 3.58873000 5.49907000 -9.06335500 -5.92535200 3.39656300
O -1.78539000 5.63328000 6.33713000 -1.38418800 -1.43969000 0.23882500
O 1.92996000 7.50440000 3.24840000 -15.10262300 1.74603900 16.21832800
O 2.44378000 0.78064000 0.36853000 0.83658500 -0.23813500 0.38009400
O 4.33562000 -0.00148000 3.34275000 29.34903200 11.20015800 -98.45349900
O 0.99341000 1.77721000 6.50654000 -1.41883800 -0.99794000 0.18077800
O 4.19455000 3.77646000 3.12181000 1.28321800 0.38678900 0.67891900
O 3.04308000 2.30169000 6.49414000 2.73527400 0.23960000 0.96836800
O 1.56992000 3.58274000 3.09565000 -1.55955900 1.67463900 2.55925300
4
Lattice="4.76411 0.0 0.988552825 0.0 4.76411 0.0 0.819614625 0.0 3.94995" Properties=species:S:1:pos:R:3:forces:R:3 stress="-0.12453601395105318 -0.0 0.05932598114715919 -0.0 -0.052608794127774104 -0.0 0.05932598114715919 -0.0 -0.022565889135211126" free_energy=-21.92976455 energy=-21.92976455 pbc="T T T"
Ba 0.22257000 2.38205000 1.07265000 -2.67711200 0.00000000 0.81685100
Ba 2.97910000 0.00000000 3.37158000 2.67711200 0.00000000 -0.81685100
O 0.00000000 0.00000000 0.00000000 -1.86397300 0.00000000 0.30610300
O 2.38205000 2.38205000 0.49428000 1.86397300 -0.00000000 -0.30610300
2
Lattice="3.28 0.0 0.0 0.0 3.28 0.0 0.0 0.0 3.28" Properties=species:S:1:pos:R:3:forces:R:3 stress="-0.0504758769726769 -0.00039203542970585325 4.173073001565547e-05 -0.00039203542970585325 -0.05065688697883664 0.00012896206155899988 4.173073001565547e-05 0.00012896206155899988 -0.0504665958486067" free_energy=-11.04616834 energy=-11.04616834 pbc="T T T"
Ba 0.05105000 3.25515000 0.04447000 0.01003200 0.03107700 -0.00326700
O 1.71446000 1.68851000 1.67685000 -0.01003200 -0.03107700 0.00326700
config.txt

Linux-cpp-lisp Jul 18, 2024
Maintainer

Can you open a new discussion for this new question?

simonbatzner · 2022-01-19T20:03:23Z

simonbatzner
Jan 19, 2022
Maintainer

Hi @peppe69, following up on Alby's reply. A few things:

as he pointed out the absolutely crucial thing is to use the normalization he suggested, which will give you correct size-extensive scaling.
also as he pointed out, forces massively help, we are talking factors of 5-10x here but a pretty universal thing (this is not unique to NequIP, see e.g. here). This does not only hold for the forces, but also for energies when they are assisted by forces

You seem to have changed a number of the other hyperparameters as well in ways that I think are suboptimal. I will list them here:

the irreps are set in a way that will not give good results:

irreps_edge_sh: 0e + 1o
conv_to_output_hidden_irreps_out: 16x0o + 16x0e + 16x1o + 16x1e + 16x2o + 16x2e
feature_irreps_hidden: 16x0o + 16x0e

Note that the feature_irreps_hidden describe the feature internally in the network. If you set these to l=0, you will get back a scalar network, which will not have the nice properties of NequIP. The minimum you should set here is l=1. The irreps_edge_sh should go along with the feature_irreps_hidden, if you set the hidden to l=1, you should also set the edge irreps to l=1. Finally, the conv_to_output_hidden_irreps_out should always just be even scalars, tensor features will have no effect here. So here's the l=1 choice:

feature_irreps_hidden: 16x0o + 16x0e + 16x1o + 16x1e
irreps_edge_sh: 0e + 1o
conv_to_output_hidden_irreps_out: 16x0e

to get up to l=1 features with parity, or if you want to exclude parity, you can use:

feature_irreps_hidden:  16x0e + 16x1e
irreps_edge_sh: 0e + 1e
conv_to_output_hidden_irreps_out: 16x0e

If you want to have a more accurate l=2 network (also more expensive, then set:

feature_irreps_hidden: 16x0o + 16x0e + 16x1o + 16x1e + 16x2o + 16x2e
irreps_edge_sh: 0e + 1o + 2e
conv_to_output_hidden_irreps_out: 16x0e

r_max is important to tune, 3.5 seems low, usually 4-6 works best, but certainly possible that 3.5 is good for your system, just be sure to check that
batch_size: we found small batch sizes to be important for joint energy-force training (the forces add 3N labels per structure, so you have 3N+1 labels instead of just 1 in the case of energy-only training). For energy-only training, larger batch sizes may work well (and will give you faster training). You might also need different learning rates, those are both important to scan. But again, if you have forces, you should always use them.

I've attached a suggested energy config below for you to try (I kept your small n_train but obviously more is better!). Let us know how this works!

# general
root: results/test-iron
run_name: test-minimal_eng-suggested
seed: 12345678
append: True

# network
model_builders:
  - EnergyModel
  - PerSpeciesRescale
  - RescaleEnergyEtc

default_dtype: float64                                                           

r_max: 5.0                                                                         
num_layers: 4 
                                                                     
chemical_embedding_irreps_out: 16x0e                                                                                                                 
irreps_edge_sh: 0e + 1o + 2e
conv_to_output_hidden_irreps_out: 16x0e
feature_irreps_hidden: 16x0o + 16x0e + 16x1o + 16x1e + 16x2o + 16x2e

nonlinearity_type: gate                                                            
resnet: false   
use_sc: true   


# radial network basis
num_basis: 8                                                                       
BesselBasis_trainable: true                                                        
PolynomialCutoff_p: 6                                                             

# radial network
invariant_layers: 2                                                                
invariant_neurons: 64                                                              
avg_num_neighbors: null                                                           
                                                                   

# data set
dataset: ase                                                                       
dataset_file_name: /media/HDD/alocaputo/DB0.xyz
key_mapping:

chemical_symbols:
  - Fe

# logging
wandb: false
wandb_project: nequip-energy

verbose: info                                                                       
log_batch_freq: 1                                                                   
log_epoch_freq: 1                                                                  
save_checkpoint_freq: -1                                                           
save_ema_checkpoint_freq: -1                                                        

# training
#n_train: 4800                                                                      
n_train: 12                                                                       
#n_val: 1201                                                                          
n_val: 5                                                                          
learning_rate: 0.005        
batch_size: 5                              
max_epochs: 100000                                                           
train_val_split: random           
shuffle: true                                                                       
metrics_key: validation_loss                                                       
use_ema: true       
ema_decay: 0.99                                                                     
ema_use_num_updates: true                                                           
report_init_validation: false                                                    

early_stopping_patiences:                                                          # stop early if a metric value stopped decreasing for n epochs
  validation_loss: 50

early_stopping_cumulative_delta: false                                             # If True, the minimum value recorded will not be updated when the decrease is smaller than delta

early_stopping_lower_bounds:                                                       # stop early if a metric value is lower than the bound
  LR: 1.0e-6

loss_coeffs:                                                                       # different weights to use in a weighted loss functions
  total_energy:                                                                    # alternatively, if energies are not of importance, a force weight 1 and an energy weight of 0 also works.
    - 1
    - PerAtomMSELoss                                                              

# output metrics                                                                    # (*gl*: added this section)
metrics_components:                                                                                                             
  - - total_energy
    - mae
    - PerAtom: True                                                                # if true, energy is normalized by the number of atoms


optimizer_name: Adam

1 reply

peppe69 Jan 20, 2022
Author

Thank you very much for the detailed answer. There are a lot of valuable hints and as soon as possible we'll lunch a new training.
And, sure, we'll inform you about the results!

peppe69 · 2022-02-16T10:16:08Z

peppe69
Feb 16, 2022
Author

Hi,
finally we succeeded in training a nequip model with our data, and with energies and forces.
But maybe we found a bug in the nequip code, so please check carefully what follows.
In detail, the per-atom energy statistics for the whole dataset are: mean=-3460.8266742392325; std=0.16236037667479927. The same evaluated by the nequip are: dataset_per_atom_total_energy_mean=-22883.892153712808, dataset_per_atom_total_energy_std=62120.49343479028
So we debugged the code, and found this: in nequip/data/dataset.py, line 540, the per-atom energies are evaluated as arr / N.
Since the shape of the tensors is different (arr: [n_samples, 1]; N: [n_samples]), element-wise division is NOT performed: indeed, the result is a tensor with shape [n_samples, n_samples]; as a consequence, torch.mean(arr) gives the wrong value.
We changed the code by reshaping N = N.reshape((-1,1)) before arr / N, and the expected, true values are obtained.
Then we trained the nequip with a variant of example.yaml with l_max=2, and obtained a mae on e/N (validation) of about 1.4 meV after 80 epochs. Moreover, the prediction of atomic properties such as equation of state, Bain path, vacancy formation energy, are all good.
That’s all, thanks in advance for your comments!

2 replies

Linux-cpp-lisp Feb 16, 2022
Maintainer

Hi @peppe69 ,

This appears based on my initial investigations to be an issue, yes— thank you for checking on this and letting us know! I will reply again when I have more info.

nw13slx Feb 22, 2022
Collaborator

@Linux-cpp-lisp has fixed this issue at Pull #157. Just a note for everyone.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

❓ Training energies with variable number of atoms across frames #137

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

❓ Training energies with variable number of atoms across frames #137

peppe69 Jan 19, 2022

Training a model with xyz dataset

Replies: 3 comments · 5 replies

Linux-cpp-lisp Jan 19, 2022 Maintainer

mariummou Jul 17, 2024

Linux-cpp-lisp Jul 18, 2024 Maintainer

simonbatzner Jan 19, 2022 Maintainer

peppe69 Jan 20, 2022 Author

peppe69 Feb 16, 2022 Author

Linux-cpp-lisp Feb 16, 2022 Maintainer

nw13slx Feb 22, 2022 Collaborator

peppe69
Jan 19, 2022

Replies: 3 comments 5 replies

Linux-cpp-lisp
Jan 19, 2022
Maintainer

Linux-cpp-lisp Jul 18, 2024
Maintainer

simonbatzner
Jan 19, 2022
Maintainer

peppe69 Jan 20, 2022
Author

peppe69
Feb 16, 2022
Author

Linux-cpp-lisp Feb 16, 2022
Maintainer

nw13slx Feb 22, 2022
Collaborator