hfcs-fffit is a repository used for a case study of our machine learning directed force field optimization workflow outlined in this preprint. Here we use the procedure to generate force fields for two hydrofluorocarbons (HFCs), HFC-32 and HFC-125.
This work has been submitted for review. In the meantime, you may cite the preprint as:
BJ Befort, RS DeFever, GM Tow, AW Dowling, and EJ Maginn. Machine learning
directed optimization of classical molecular modeling force fields. arXiv
(2021), https://arxiv.org/abs/2103.03208
The non-dominated and "top four" parameter sets for each HFC are
provided under hfcs-fffit/analysis/csv/
. The non-dominated
sets are found in r32-pareto.csv
and r125-pareto.csv
, and
the "top four" sets are found in r32-final-4.csv
and
r125-final-4.csv
. The parameter values in the csv files are
normalized between 0 and 1 based upon the parameter bounds for each
atom type (see manuscript, or hfcs-fffit/analysis/utils/r32.py
and hfcs-fffit/analysis/utils/r125.py
for definitions of
the upper and lower parameter bounds).
All molecular simulations were performed under hfcs-fffit/runs
.
Each iteration was managed with signac-flow
. Inside of each
directory in runs
, you will find all the necessary files to
run the simulations. Note that you may not get the exact same simulation
results due to differences in software versions, random seeds, etc.
Nonetheless, all of the results from our molecular simulations are saved
under hfcs-fffit/analysis/csv/rXX-YY-iterZZ-results.csv
, where XX
is the molecule, YY
is the stage (liquid density or VLE), and
ZZ
is the iteration number.
All of the scripts for the surrogate modeling are provided in
hfcs-fffit/analysis
, following the same naming structure as
the csv files.
All scripts required to generate the primary figures in the
manuscript are reported under hfcs-fffit/analysis/final-figs
and the
associated PDF files are located under
hfcs-fffit/analysis/final-figs/pdfs
.
This package has a number of requirements that can be installed in different ways. We recommend using a conda environment to manage most of the installation and dependencies. However, some items will need to be installed from source or pip.
We recommend starting with a fresh conda environment, then installing
the packages listed under requirements-pip.txt
with pip, then
installing the packages listed under requirements-conda.txt
with
conda, and finally installing a few items from source
requirements-other.txt
. We recommend python3.7
and
taking packages from conda-forge
.
Running the simulations will also require an installation of GROMACS. This can be installed separately (see installation instructions here).
WARNING: Cloning the hfcs-fffit
repository will take some time
and ~1.5 GB of disk space since it contains the Latin hypercube
that have ~1e6 parameter sets each.
An example of the procedure is provided below:
# First clone hfcs-fffit and install pip/conda available dependencies
# with a new conda environment named hfcs-fffit
git clone git@github.com:dowlinglab/hfcs-fffit.git
cd hfcs-fffit/
conda create --name hfcs-fffit python=3.7 -c conda-forge
conda activate hfcs-fffit
python3 -m pip install -r requirements-pip.txt
conda install --file requirements-conda.txt -c conda-forge
cd ../
# Now clone and install other dependencies
git clone git@github.com:dowlinglab/fffit.git
# Checkout the v0.1 release of fffit and install
cd fffit/
git checkout tags/v0.1
pip install .
cd ../
# Checkout the v0.1 release of block average and install
git clone git@github.com:rsdefever/block_average.git
cd block_average/
git checkout tags/v0.1
pip install .
cd ../
NOTE: We use signac and signac flow (https://signac.io/) to manage the setup and execution of the molecular simulations. These instructions assume a working knowledge of that software.
The first iteration of the liquid density simulations were
performed under the hfcs-fffit/runs/r32-density-iter1/
.
A Latin hypercube with 200 parameter sets exists under
hfcs-fffit/runs/r32-density-iter1/data/lh_samples_200_r32.txt
.
The signac workspace is created by hfcs-fffit/runs/r32-density-iter1/init.py
.
cd hfcs-fffit/runs/r32-density-iter1/
python init.py
The thermodynamic conditions for the simulations and the bounds for each parameter
(LJ sigma and epsilon for C, F, and H) are defined inside init.py
.
The simulation workflow is
defined in hfcs-fffit/runs/r32-density-iter1/project.py
. The flow operations
defined therein create the simulation input files, perform the simulations,
and run the analysis (calculating the average density). In order to run
these flow operations on a cluster with a job scheduler, it will be
necessary to edit the files under
hfcs-fffit/runs/r32-density-iter1/templates/
to be compatible with
your cluster. The signac documentation contains the necessary details.
Once the first iteration of simulations have completed (i.e., all the flow
operations are done), you can perform analysis. The necessary files are located
under hfcs-fffit/runs/analysis
and hfcs-fffit/runs/analysis/r32-density-iter1
.
The first step is to extract the results from your signac project into a CSV file
so they can be stored and accessed more easily in the future. This step is
performed by extract_r32_density.py
. The script requires the iteration number
as a command line argument.
WARNING: Running this script will overwrite your local copy of our simulation results (stored as CSV files) with the results from your simulations.
To extract the results for iteration 1 run the following:
cd hfcs-fffit/analysis/
python extract_r32_density.py 1
The CSV file with the results is saved under
hfcs-fffit/analysis/csv/rXX-YY-iterZZ-results.csv
where XX
is the molecule, YY
is the stage (liquid density or VLE), and
ZZ
is the iteration number.
The analysis is performed within a separate directory for each iteration.
For example, for the first iteration, it is performed under
hfcs-fffit/analysis/r32-density-iter1
. The script id-new-samples.py
loads the results from the CSV file, fits the SVM classifier and GP surrogate
models, loads the Latin hypercube with 1e6 prospective parameter sets,
and identifies the 200 new parameter sets to use for molecular simulations in
iteration 2. These parameter sets are saved to a CSV file:
hfcs-fffit/analysis/csv/r32-density-iter2-params.csv
.
The second iteration of the liquid density simulations were
performed under the hfcs-fffit/runs/r32-density-iter2/
. The procedure
is the same as for iteration 1, but this time the force field parameters
are taken from: hfcs-fffit/analysis/csv/r32-density-iter2-params.csv
.
The procedure for analysis is likewise analogous to iteration 1, however,
note that in training the surrogate models,
hfcs-fffit/runs/analysis/r32-density-iter2/id-new-samples.py
now uses
the simulation results from both iterations 1 and 2.
The optimization for the vapor-liquid equilibrium iterations is very similar.
The final (iteration 4) liquid density analysis saves the parameters as
hfcs-fffit/analysis/csv/r32-vle-iter1-params.csv
. The first VLE iteration
begins with these parameters. Once again, the simulations are performed under:
hfcs-fffit/runs/r32-vle-iter1
. The init.py
is used to set up the
signac workspace, and project.py
defines the simulation workflow
(create the inputs, perform the molecular simulations, and run the analysis).
The results are extracted into the CSV file with
hfcs-fffit/analysis/extract_r32_vle.py
. Once again the iteration number
is a command line argument, and the results are saved to a CSV file
hfcs-fffit/analysis/csv/r32-vle-iter1-results.csv
.
Each VLE iteration has a folder with the analysis
scripts (e.g., hfcs-fffit/analysis/r32-vle-iter1
). The analysis.py
and evaluate-gps.py
perform basic analysis and create figures evaluating
the performance of the GP models. The id-new-samples.py
loads a Latin
hypercube with 1e6 prospective parameter sets, and identifies the top-performing
parameter sets which will be evaluated with molecular simulations during the
subsequent iteration. For example, the parameter sets to be used for the second
VLE iteration are saved to hfcs-fffit/analysis/csv/r32-vle-iter2-params.csv
.
Each subsequent VLE iteration is performed in the same manner.
This work was supported by the National Science Foundation under grant NSF Award Number OAC-1835630 and NSF Award Number CBET-1917474. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.