Skip to content

Code and data used to create and evaluate LLM4Mat-Bench

Notifications You must be signed in to change notification settings

vertaix/LLM4Mat-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM4Mat-Bench

LLM4Mat-Bench is the largest benchmark to date for evaluating the performance of large language models (LLMs) for materials property prediction.

image
LLM4Mat-Bench Statistics. *https://www.snumat.com/apis

How to use

Installation

git clone https://github.com/vertaix/LLM4Mat-Bench.git
cd LLM4Mat-Bench
conda create -n <environment_name> requirement.txt
conda activate <environment_name>

Get the data

  • Download the LLM4Mat-Bench data from this link. Each dataset includes a fixed train/validation/test split for reproducibility and fair model comparison.
  • Save the data into data folder where LLM4Mat-Bench is the parent directory.

Get the checkpoints

  • Download the LLM-Prop and MatBERT checkpoints from this link.
  • Save the checkpoints folder into LLM4Mat-Bench directory.

Evaluating the trained LLM-Prop and MatBERT

Add any modification to the following scripts to evaluate.sh

#!/usr/bin/env bash

DATA_PATH='data/' # where LLM4Mat_Bench data is saved
RESULTS_PATH='results/' # where to save the results
CHECKPOINTS_PATH='checkpoints/' # where model weights were saved
MODEL_NAME='llmprop' # or 'matbert'
DATASET_NAME='mp' # any dataset name in LLM4Mat_Bench
INPUT_TYPE='formula' # other values: 'cif_structure' and 'description'
PROPERTY_NAME='band_gap' # any property name in $DATASET_NAME. Please check the property names associated with each dataset first

python code/llmprop_and_matbert/evaluate.py \
--data_path $DATA_PATH \
--results_path $RESULTS_PATH \
--checkpoints_path $CHECKPOINTS_PATH \
--model_name $MODEL_NAME \
--dataset_name $DATASET_NAME \
--input_type $INPUT_TYPE \
--property_name $PROPERTY_NAME

Then run

 bash scripts/evaluate.sh

Training LLM-Prop and MatBERT from scratch

Add any modification to the following scripts to train.sh

#!/usr/bin/env bash

DATA_PATH='data/' # where LLM4Mat_Bench data is saved
RESULTS_PATH='results/' # where to save the results
CHECKPOINTS_PATH='checkpoints/' # where to save model weights 
MODEL_NAME='llmprop' # or 'matbert'
DATASET_NAME='mp' # any dataset name in LLM4Mat_Bench
INPUT_TYPE='formula' # other values: 'cif_structure' and 'description'
PROPERTY_NAME='band_gap' # any property name in $DATASET_NAME. Please check the property names associated with each dataset first
MAX_LEN=256 # for testing purposes only, the default value is 888 while 2000 has shown to give the best performance
EPOCHS=5 #for testing purposes only, the default value is 200

python code/llmprop_and_matbert/train.py \
--data_path $DATA_PATH \
--results_path $RESULTS_PATH \
--checkpoints_path $CHECKPOINTS_PATH \
--model_name $MODEL_NAME \
--dataset_name $DATASET_NAME \
--input_type $INPUT_TYPE \
--property_name $PROPERTY_NAME \
--max_len $MAX_LEN \
--epochs $EPOCHS

Then run

bash scripts/train.sh

Generating the property values with LLaMA2-7b-chat model

Add any modification to the following scripts to llama_inference.sh

#!/usr/bin/env bash

DATA_PATH='data/' # where LLM4Mat_Bench data is saved
RESULTS_PATH='results/' # where to save the results
DATASET_NAME='mp' # any dataset name in LLM4Mat_Bench
INPUT_TYPE='formula' # other values: 'cif_structure' and 'description'
PROPERTY_NAME='band_gap' # any property name in $DATASET_NAME. Please check the property names associated with each dataset first
PROMPT_TYPE='zero_shot' # 'few_shot' can also be used here which let llama see five examples before it generates the answer
MAX_LEN=800 # max_len and batch_size can be modified according to the available resources
BATCH_SIZE=8

python code/llama/llama_inference.py \
--data_path $DATA_PATH \
--results_path $RESULTS_PATH \
--dataset_name $DATASET_NAME \
--input_type $INPUT_TYPE \
--property_name $PROPERTY_NAME \
--prompt_type $PROMPT_TYPE \
--max_len $MAX_LEN \
--batch_size $BATCH_SIZE

Then run

bash scripts/llama_inference.sh

Evaluating the LLaMA results

After running bash scripts/llama_inference.sh, add any modification to the following scripts to llama_evaluate.sh

#!/usr/bin/env bash

DATA_PATH='data/' # where LLM4Mat_Bench data is saved
RESULTS_PATH='results/' # where to save the results
DATASET_NAME='mp' # any dataset name in LLM4Mat_Bench
INPUT_TYPE='formula' # other values: 'cif_structure' and 'description'
PROPERTY_NAME='band_gap' # any property name in $DATASET_NAME. Please check the property names associated with each dataset first
PROMPT_TYPE='zero_shot' # 'few_shot' can also be used here which let llama see five examples before it generates the answer
MAX_LEN=800 # max_len and batch_size can be modified according to the available resources
BATCH_SIZE=8
MIN_SAMPLES=2 # minimum number of valid outputs from llama (the default number is 10)

python code/llama/evaluate.py \
--data_path $DATA_PATH \
--results_path $RESULTS_PATH \
--dataset_name $DATASET_NAME \
--input_type $INPUT_TYPE \
--property_name $PROPERTY_NAME \
--prompt_type $PROMPT_TYPE \
--max_len $MAX_LEN \
--batch_size $BATCH_SIZE \
--min_samples $MIN_SAMPLES

Then run

bash scripts/llama_evaluate.sh

Data LICENSE

The data LICENSE belongs to the original creators of each dataset/database.

Leaderboard

Input Model MP JARVIS-DFT GNoME hMOF Cantor HEA JARVIS-QETB OQMD QMOF SNUMAT OMDB
Regression Classification Regression Regression Regression Regression Regression Regression Regression Classification Regression Regression
8 tasks 2 tasks 20 tasks 6 tasks 7 tasks 6 tasks 4 tasks 2 tasks 4 tasks 4 tasks 3 tasks 1 task
CIF CGCNN (baseline) 5.319 0.846 7.048 19.478 2.257 17.780 61.729 14.496 3.076 1.973 0.722 2.751
Comp. Llama 2-7b-chat:0S 0.389 0.491 Inval. 0.164 0.174 0.034 0.188 0.105 0.303 0.940 Inval. 0.885
Llama 2-7b-chat:5S 0.627 0.507 0.704 0.499 0.655 0.867 1.047 1.160 0.932 1.157 0.466 1.009
MatBERT-109M 5.317 0.722 4.103 12.834 1.430 6.769 11.952 5.772 2.049 1.828 0.712 1.554
LLM-Prop-35M 4.394 0.691 2.912 15.599 1.479 8.400 59.443 6.020 1.958 1.509 0.719 1.507
CIF Llama 2-7b-chat:0S 0.392 0.501 0.216 6.746 0.214 0.022 0.278 0.028 0.119 0.682 0.489 0.159
Llama 2-7b-chat:5S Inval. 0.502 Inval. Inval. Inval. Inval. 1.152 1.391 Inval. Inval. 0.474 0.930
MatBERT-109M 7.452 0.750 6.211 14.227 1.514 9.958 47.687 10.521 3.024 2.131 0.717 1.777
LLM-Prop-35M 8.554 0.738 6.756 16.032 1.623 15.728 97.919 11.041 3.076 1.829 0.660 1.777
Descr. Llama 2-7b-chat:0S 0.437 0.500 0.247 0.336 0.193 0.069 0.264 0.106 0.152 0.883 Inval. 0.155
Llama 2-7b-chat:5S 0.635 0.502 0.703 0.470 0.653 0.820 0.980 1.230 0.946 1.040 0.568 1.001
MatBERT-109M 7.651 0.735 6.083 15.558 1.558 9.976 46.586 11.027 3.055 2.152 0.730 1.847
LLM-Prop-35M 9.116 0.742 7.204 16.224 1.706 15.926 93.001 9.995 3.016 1.950 0.735 1.656
Results for MP dataset. The performance on regression tasks is evaluated in terms of MAD:MAE ratio (the higher the better) while that of classification tasks (Is Stable and Is Gab Direct) is evaluated in terms of AUC score. FEPA: Formation Energy Per Atom, EPA: Energy Per Atom.
Input Model MP Dataset
FEPA Bandgap EPA Ehull Efermi Density Density Atomic Volume Is Stable Is Gab Direct
145.2K 145.3K 145.2K 145.2K 145.2K 145.2K 145.2K 145.2K 145.2K 145.2K
CIF CGCNN (baseline) 8.151 3.255 7.224 3.874 3.689 8.773 5.888 1.703 0.882 0.810
Comp. Llama 2-7b-chat:0S 0.008 0.623 0.009 0.001 0.003 0.967 0.754 0.747 0.500 0.482
Llama 2-7b-chat:5S 0.33 1.217 0.239 0.132 0.706 0.899 0.724 0.771 0.502 0.512
MatBERT-109M 8.151 2.971 9.32 2.583 3.527 7.626 5.26 3.099 0.764 0.681
LLM-Prop-35M 7.482 2.345 7.437 2.006 3.159 6.682 3.523 2.521 0.746 0.636
CIF Llama 2-7b-chat:0S 0.032 0.135 0.022 0.001 0.015 0.97 0.549 1.41 0.503 0.499
Llama 2-7b-chat:5S Inval. 1.111 0.289 Inval. 0.685 0.98 0.99 0.926 0.498 0.506
MatBERT-109M 11.017 3.423 13.244 3.808 4.435 10.426 6.686 6.58 0.790 0.710
LLM-Prop-35M 14.322 3.758 17.354 2.182 4.515 13.834 4.913 7.556 0.776 0.700
Descr. Llama 2-7b-chat:0S 0.019 0.633 0.023 0.001 0.008 1.31 0.693 0.807 0.500 0.500
Llama 2-7b-chat:5S 0.394 1.061 0.297 0.247 0.684 0.916 0.782 0.704 0.500 0.504
MatBERT-109M 11.935 3.524 13.851 4.085 4.323 9.9 6.899 6.693 0.794 0.713
LLM-Prop-35M 15.913 3.931 18.412 2.74 4.598 14.388 4.063 8.888 0.794 0.690
Results for JARVIS-DFT. The performance on regression tasks is evaluated in terms of MAD:MAE ratio (the higher the better). FEPA: Formation Energy Per Atom, Tot. En.: Total Energy, Exf. En.: Exfoliation Energy.
Input Model JARVIS-DFT Dataset
FEPA Bandgap (OPT) Tot. En. Ehull Bandgap (MBJ) Kv Gv SLME Spillage εx (OPT) ε (DFPT) Max. Piezo. (dij) Max. Piezo. (eij) Max. EFG Exf. En. Avg. me n-Seebeck n-PF p-Seebeck p-PF
75.9K 75.9K 75.9K 75.9K 19.8K 23.8K 23.8K 9.7K 11.3K 18.2K 4.7K 3.3K 4.7K 11.8K 0.8K 17.6K 23.2K 23.2K 23.2K 23.2K
CIF CGCNN (baseline) 13.615 4.797 22.906 1.573 4.497 3.715 2.337 1.862 1.271 2.425 1.12 0.418 1.291 1.787 0.842 1.796 2.23 1.573 3.963 1.59
Comp. Llama 2-7b-chat:0S 0.021 0.011 0.02 0.005 0.92 0.428 0.374 0.148 Inval. 0.18 0.012 0.121 0.001 0.141 0.384 0.028 0.874 0.801 0.971 0.874
Llama 2-7b-chat:5S 0.886 0.011 0.02 1.292 0.979 0.88 0.992 0.456 0.85 1.148 1.416 1.289 1.305 0.765 0.512 0.535 1.008 1.04 0.93 0.568
MatBERT-109M 6.808 4.083 9.21 2.786 3.755 2.906 1.928 1.801 1.243 2.017 1.533 1.464 1.426 1.658 1.124 2.093 1.908 1.318 2.752 1.356
LLM-Prop-35M 4.765 2.621 5.936 2.073 2.922 2.162 1.654 1.575 1.14 1.734 1.454 1.447 1.573 1.38 1.042 1.658 1.725 1.145 2.233 1.285
CIF Llama 2-7b-chat:0S 0.023 0.011 0.02 0.002 0.193 0.278 0.358 0.186 0.702 0.781 0.033 0.104 0.001 0.246 0.411 0.041 0.429 0.766 0.83 0.826
Llama 2-7b-chat:5S 0.859 Inval. Inval. 1.173 1.054 0.874 0.91 0.486 0.916 1.253 Inval. Inval. Inval. 0.796 0.51 Inval. 1.039 1.396 Inval. Inval.
MatBERT-109M 10.211 5.483 15.673 4.862 5.344 4.283 2.6 2.208 1.444 2.408 1.509 1.758 2.405 2.143 1.374 2.45 2.268 1.446 3.337 1.476
LLM-Prop-35M 12.996 3.331 22.058 2.648 4.93 4.121 2.409 2.175 1.37 2.135 1.578 2.103 2.405 1.936 1.044 1.796 1.955 1.332 2.503 1.399
Descr. Llama 2-7b-chat:0S 0.007 0.011 0.02 0.004 0.94 0.498 0.382 0.07 0.135 0.647 0.08 0.266 0.001 0.138 0.285 0.019 0.769 0.793 0.825 0.829
Llama 2-7b-chat:5S 0.845 0.011 0.02 1.273 1.033 0.87 0.969 0.461 0.857 1.201 1.649 1.174 1.152 0.806 0.661 0.523 1.098 1.024 0.948 0.563
MatBERT-109M 10.211 5.33 15.141 4.691 5.01 4.252 2.623 2.178 1.452 2.384 1.534 1.807 2.556 2.081 1.36 2.597 2.241 1.432 3.26 1.565
LLM-Prop-35M 12.614 3.427 23.509 4.532 4.983 4.128 2.419 2.061 1.307 2.334 1.64 2.116 2.315 1.978 1.168 1.858 2.154 1.364 2.61 1.407
Results for SNUMAT. The performance on regression tasks is evaluated in terms of MAD:MAE ratio (the higher the better) while that of classification tasks (Is Direct, Is Direct HSE, and SOC) is evaluated in terms of AUC score.
Input Model SNUMAT Dataset
Bandgap GGA Bandgap HSE Bandgap GGA Optical Bandgap HSE Optical Is Direct Is Direct HSE SOC
10.3K 10.3K 10.3K 10.3K 10.3K 10.3K 10.3K
CIF CGCNN (baseline) 2.075 2.257 1.727 1.835 0.691 0.675 0.800
Comp. Llama 2-7b-chat:0S 0.797 0.948 1.156 0.859 0.503 0.484 Inval.
Llama 2-7b-chat:5S 1.267 1.327 0.862 1.174 0.475 0.468 0.455
MatBERT-109M 1.899 1.975 1.646 1.793 0.671 0.645 0.820
LLM-Prop-35M 1.533 1.621 1.392 1.491 0.647 0.624 0.829
CIF Llama 2-7b-chat:0S 0.346 0.454 1.09 0.838 0.479 0.488 0.500
Llama 2-7b-chat:5S Inval. Inval. Inval. Inval. 0.494 0.500 0.427
MatBERT-109M 2.28 2.472 1.885 1.889 0.677 0.650 0.823
LLM-Prop-35M 1.23 2.401 1.786 1.9 0.661 0.664 0.656
Descr. Llama 2-7b-chat:0S 0.802 0.941 1.013 0.779 0.499 0.509 Inval.
Llama 2-7b-chat:5S 0.774 1.315 0.901 1.172 0.594 0.623 0.486
MatBERT-109M 2.298 2.433 1.901 1.978 0.683 0.645 0.862
LLM-Prop-35M 2.251 2.142 1.84 1.569 0.681 0.657 0.866
Results for GNoME. The performance on regression tasks is evaluated in terms of MAD:MAE ratio (the higher the better). FEPA: Formation Energy Per Atom, DEPA: Decomposition Energy Per Atom, Tot. En.: Total Energy.
Input Model GNoME Dataset
FEPA Bandgap DEPA Tot. En. Volume Density
376.2K 282.7K 376.2K 282.7K 282.7K 282.7K
CIF CGCNN (baseline) 34.57 8.549 2.787 7.443 7.967 56.077
Comp. Llama 2-7b-chat:0S 0.002 0.177 0.0 0.088 0.455 0.368
Llama 2-7b-chat:5S 0.194 0.086 0.255 0.765 1.006 0.865
MatBERT-109M 30.248 4.692 2.787 8.57 13.157 15.145
LLM-Prop-35M 25.472 3.735 1.858 21.624 16.556 25.615
CIF Llama 2-7b-chat:0S 0.003 0.045 0.0 0.706 43.331 0.794
Llama 2-7b-chat:5S Inval. 0.087 Inval. Inval. 1.029 0.878
MatBERT-109M 24.199 9.16 3.716 15.309 16.691 16.467
LLM-Prop-35M 28.469 3.926 3.344 17.837 17.082 25.615
Descr. Llama 2-7b-chat:0S 0.002 0.114 0.0 0.661 0.654 0.805
Llama 2-7b-chat:5S 0.192 0.086 0.106 0.75 1.006 0.891
MatBERT-109M 30.248 5.829 3.716 18.205 17.824 16.599
LLM-Prop-35M 28.469 5.27 3.716 17.02 17.02 25.936
Results for hMOF. The performance on regression tasks is evaluated in terms of MAD:MAE ratio (the higher the better).
Input Model hMOF Dataset
Max CO2 Min CO2 LCD PLD Void Fraction Surface Area m2g Surface Area m2cm3
132.7K 132.7K 132.7K 132.7K 132.7K 132.7K 132.7K
CIF CGCNN (baseline) 1.719 1.617 1.989 1.757 2.912 3.765 2.039
Comp. Llama 2-7b-chat:0S 0.011 0.002 0.009 0.008 0.5 0.454 0.233
Llama 2-7b-chat:5S 0.679 0.058 0.949 1.026 0.945 0.567 0.366
MatBERT-109M 1.335 1.41 1.435 1.378 1.57 1.517 1.367
LLM-Prop-35M 1.41 1.392 1.432 1.468 1.672 1.657 1.321
CIF Llama 2-7b-chat:0S 0.017 0.003 0.016 0.011 0.549 0.54 0.359
Llama 2-7b-chat:5S Inval. Inval. 0.951 1.067 Inval. Inval. Inval.
MatBERT-109M 1.421 1.428 1.544 1.482 1.641 1.622 1.461
LLM-Prop-35M 1.564 1.41 1.753 1.435 1.9 1.926 1.374
Descr. Llama 2-7b-chat:0S 0.129 0.014 0.026 0.006 0.382 0.497 0.299
Llama 2-7b-chat:5S 0.684 0.058 0.955 1.006 0.931 0.571 0.37
MatBERT-109M 1.438 1.466 1.602 1.511 1.719 1.697 1.475
LLM-Prop-35M 1.659 1.486 1.623 1.789 1.736 2.144 1.508
Results for Cantor HEA. The performance on regression tasks is evaluated in terms of MAD:MAE ratio (the higher the better). FEPA: Formation Energy Per Atom, EPA: Energy Per Atom, VPA: Volume Per Atom.
Input Model Cantor HEA Dataset
FEPA EPA Ehull VPA
84.0K 84.0K 84.0K 84.0K
CIF CGCNN (baseline) 9.036 49.521 9.697 2.869
Comp. Llama 2-7b-chat:0S 0.005 0.098 0.003 0.031
Llama 2-7b-chat:5S 0.896 0.658 0.928 0.986
MatBERT-109M 3.286 16.17 5.134 2.489
LLM-Prop-35M 3.286 22.638 5.134 2.543
CIF Llama 2-7b-chat:0S 0.001 0.084 0.0 0.004
Llama 2-7b-chat:5S Inval. Inval. Inval. Inval.
MatBERT-109M 7.229 17.607 9.187 5.809
LLM-Prop-35M 8.341 36.015 11.636 6.919
Descr. Llama 2-7b-chat:0S 0.001 0.101 0.164 0.011
Llama 2-7b-chat:5S 0.797 0.615 0.938 0.93
MatBERT-109M 7.229 17.607 9.187 5.881
LLM-Prop-35M 8.341 36.015 11.636 7.713
Results for QMOF. The performance on regression tasks is evaluated in terms of MAD:MAE ratio (the higher the better). Tot. En.: Total Energy.
Input Model QMOF Dataset
Bandgap Tot. En. LCD PLD
7.6K 7.6K 7.6K 7.6K
CIF CGCNN (baseline) 2.431 1.489 4.068 4.317
Comp. Llama 2-7b-chat:0S 0.901 0.26 0.045 0.009
Llama 2-7b-chat:5S 0.648 0.754 1.241 1.086
MatBERT-109M 1.823 1.695 2.329 2.349
LLM-Prop-35M 1.759 1.621 2.293 2.157
CIF Llama 2-7b-chat:0S 0.201 0.244 0.02 0.011
Llama 2-7b-chat:5S Inval. Inval. Inval. Inval.
MatBERT-109M 1.994 4.378 2.908 2.818
LLM-Prop-35M 2.166 4.323 2.947 2.87
Descr. Llama 2-7b-chat:0S 0.358 0.217 0.025 0.006
Llama 2-7b-chat:5S 0.777 0.713 1.125 1.17
MatBERT-109M 2.166 4.133 2.981 2.941
LLM-Prop-35M 2.091 4.312 2.831 2.829
Results for JARVIS-QETB. The performance on regression tasks is evaluated in terms of MAD:MAE ratio (the higher the better). FEPA: Formation Energy Per Atom, EPA: Energy Per Atom, Tot. En.: Total Energy, Ind. Bandgap: Indirect Bandgap.
Input Model JARVIS-QETB Dataset
FEPA EPA Tot. En. Ind. Bandgap
623.9K 623.9K 623.9K 623.9K
CIF CGCNN (baseline) 1.964 228.201 11.218 5.534
Comp. Llama 2-7b-chat:0S 0.003 0.369 0.172 0.21
Llama 2-7b-chat:5S 0.812 1.037 1.032 1.306
MatBERT-109M 1.431 37.979 8.19 0.21
LLM-Prop-35M 2.846 211.757 21.309 1.861
CIF Llama 2-7b-chat:0S 0.003 0.412 0.656 0.04
Llama 2-7b-chat:5S 0.8 1.024 1.076 1.71
MatBERT-109M 24.72 135.156 26.094 4.779
LLM-Prop-35M 23.346 318.291 48.192 1.845
Descr. Llama 2-7b-chat:0S 0.003 0.408 0.484 0.16
Llama 2-7b-chat:5S 0.85 1.015 1.035 1.021
MatBERT-109M 26.265 122.884 29.409 7.788
LLM-Prop-35M 22.513 312.218 35.43 1.845
Results for OQMD. The performance on regression tasks is evaluated in terms of MAD:MAE ratio (the higher the better). FEPA: Formation Energy Per Atom.
Input Model OQMD Dataset
FEPA Bandgap
963.5K 963.5K
CIF CGCNN (baseline) 22.291 6.701
Comp. Llama 2-7b-chat:0S 0.019 0.192
Llama 2-7b-chat:5S 1.013 1.306
MatBERT-109M 7.662 3.883
LLM-Prop-35M 9.195 2.845
CIF Llama 2-7b-chat:0S 0.009 0.047
Llama 2-7b-chat:5S 1.051 1.731
MatBERT-109M 13.879 7.163
LLM-Prop-35M 18.861 3.22
Descr. Llama 2-7b-chat:0S 0.025 0.187
Llama 2-7b-chat:5S 0.991 1.468
MatBERT-109M 15.012 7.041
LLM-Prop-35M 16.346 3.644
Results for OMDB. The performance on regression tasks is evaluated in terms of MAD:MAE ratio (the higher the better).
Input Model OMDB Dataset
Bandgap
12.1K
CIF CGCNN (baseline) 2.751
Comp. Llama 2-7b-chat:0S 0.886
Llama 2-7b-chat:5S 1.009
MatBERT-109M 1.554
LLM-Prop-35M 1.507
CIF Llama 2-7b-chat:0S 0.159
Llama 2-7b-chat:5S 0.930
MatBERT-109M 1.777
LLM-Prop-35M 1.777
Descr. Llama 2-7b-chat:0S 0.155
Llama 2-7b-chat:5S 1.002
MatBERT-109M 1.847
LLM-Prop-35M 1.656

About

Code and data used to create and evaluate LLM4Mat-Bench

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published