Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 17 additions & 18 deletions paper/bibliography.bib
Original file line number Diff line number Diff line change
Expand Up @@ -202,25 +202,24 @@ @article{tingleZINC22AFreeMultiBillionScale2023
urldate = {2025-05-24}
}
@inproceedings{hagbergExploringNetworkStructure2008,
author = {Aric A. Hagberg and Daniel A. Schult and Pieter J. Swart},
title = {Exploring Network Structure, Dynamics, and Function using NetworkX},
booktitle = {Proceedings of the 7th Python in Science Conference},
pages = {11 - 15},
address = {Pasadena, CA USA},
year = {2008},
editor = {Ga\"el Varoquaux and Travis Vaught and Jarrod Millman}
author = {Aric A. Hagberg and Daniel A. Schult and Pieter J. Swart},
year = {2008},
month = {06},
title = {Exploring Network Structure, Dynamics, and Function Using NetworkX},
journal = {Proceedings of the 7th Python in Science Conference},
doi = {10.25080/TCWV9851}
}
@article{oboyleOpenBabelOpen2011,
title = {Open {{Babel}}: {{An}} Open Chemical Toolbox},
title = {Open {{Babel}}: {{An}} Open Chemical Toolbox},
shorttitle = {Open {{Babel}}},
author = {O'Boyle, Noel M. and Banck, Michael and James, Craig A. and Morley, Chris and Vandermeersch, Tim and Hutchison, Geoffrey R.},
year = 2011,
month = oct,
journal = {Journal of Cheminformatics},
volume = {3},
number = {1},
pages = {33},
issn = {1758-2946},
doi = {10.1186/1758-2946-3-33},
urldate = {2025-11-15},
author = {O'Boyle, Noel M. and Banck, Michael and James, Craig A. and Morley, Chris and Vandermeersch, Tim and Hutchison, Geoffrey R.},
year = 2011,
month = oct,
journal = {Journal of Cheminformatics},
volume = {3},
number = {1},
pages = {33},
issn = {1758-2946},
doi = {10.1186/1758-2946-3-33},
urldate = {2025-11-15}
}
28 changes: 14 additions & 14 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,17 +21,17 @@ bibliography: bibliography.bib
# Summary

The increasing prevalence of Machine-Learned Interatomic Potentials (MLIPs) has shifted requirements for setting up atomistic simulations.
Unlike classical force fields, MLIPs primarily require atomic positions and species, thereby removing the need for predefined topology files used for classical force fields in molecular dynamics software like GROMACS[@abrahamGROMACSHighPerformance2015], LAMMPS[@LAMMPS], ESPResSo[@weikESPResSo40Extensible2019], or OpenMM[@eastmanOpenMM8Molecular2024].
Unlike classical force fields, MLIPs primarily require atomic positions and species, thereby removing the need for predefined topology files used for classical force fields in molecular dynamics software like GROMACS [@abrahamGROMACSHighPerformance2015], LAMMPS [@LAMMPS], ESPResSo [@weikESPResSo40Extensible2019], or OpenMM [@eastmanOpenMM8Molecular2024].
Consequently, the Atomic Simulation Environment (ASE) [@larsenAtomicSimulationEnvironment2017] has become a popular Python toolkit for handling atomic structures and interfacing with MLIPs, particularly within the material science and soft matter communities.
ASE originates from the electronic structure community, which shares the same setup as MLIP-driven studies.
In contrast to _ab initio_, MLIPs are much faster and can run on much larger systems, making high-throughput simulations of more complex systems feasible and increasing the need for efficient initial structure generation.

Concurrently, RDKit[@landrumRdkitRdkit2023_03_22023] offers extensive functionality for cheminformatics and manipulating chemical structures.
Concurrently, RDKit [@landrumRdkitRdkit2023_03_22023] offers extensive functionality for cheminformatics and manipulating chemical structures.
However, standard RDKit workflows are not designed for MLIP-driven simulation, while typical ASE-MLIP workflows may lack rich explicit chemical information such as bond orders or molecular identities, as well as capabilities for generating different conformations or searching substructures.

The `molify` package bridges this gap, providing an interface between RDKit's chemical structure generation and cheminformatics capabilities and ASE's handling of 3D atomic structures.
Furthermore, `molify` integrates with PACKMOL[@martinezPACKMOLPackageBuilding2009] to facilitate the creation of complex, periodic simulation cells with diverse chemical compositions, all while preserving crucial chemical connectivity information.
In addition, `molify` simplifies the representation of molecular structures as graphs using NetworkX[@hagbergExploringNetworkStructure2008], e.g., enabling traversing or comparing them.
Furthermore, `molify` integrates with PACKMOL [@martinezPACKMOLPackageBuilding2009] to facilitate the creation of complex, periodic simulation cells with diverse chemical compositions, all while preserving crucial chemical connectivity information.
In addition, `molify` simplifies the representation of molecular structures as graphs using NetworkX [@hagbergExploringNetworkStructure2008], e.g., enabling traversing or comparing them.
Lastly, the combination of these packages enables selection and manipulation of atomistic structures based on chemical knowledge rather than manual index handling.
While designed for MLIP data, the usage of `molify` is not limited and can be expanded, e.g., by utilizing the bond order information in other ASE-based workflows for classical MD simulations or integrating with machine-learning driven bond order predictions.

Expand All @@ -41,17 +41,17 @@ While designed for MLIP data, the usage of `molify` is not limited and can be ex
While its core function is to interface these tools, it thereby unlocks new capabilities and significantly reduces the manual coding and data wrangling typically required for preparing and analyzing molecular simulations.
For example, ASE has no tools for handling topological information such as bonds or molecular identities, while RDKit cannot natively interface with MLIPs.

`molify` simplifies workflows that previously involved laborious tasks such as sourcing individual structure files from various databases (e.g., the Materials Project[@jainCommentaryMaterialsProject2013] or the ZINC database[@tingleZINC22AFreeMultiBillionScale2023]) and custom setups of simulation cells.
`molify` simplifies workflows that previously involved laborious tasks such as sourcing individual structure files from various databases (e.g., the Materials Project [@jainCommentaryMaterialsProject2013] or the ZINC database [@tingleZINC22AFreeMultiBillionScale2023]) and custom setups of simulation cells.
With `molify`, more complex and chemically diverse simulation cells are easier to set up and process.

One challenge in MLIP-driven simulations is the post-simulation identification and analysis of molecular fragments or chemical changes, as explicit topological information is not available and changes in connectivity can occur.
`molify` addresses this by enabling the use of RDKit's powerful SMILES[@weiningerSMILESChemicalLanguage1988]/SMARTS-based substructure searching on ASE structures.
In addition, the resulting molecular graph can be exported to a NetworkX[@hagbergExploringNetworkStructure2008] object for further analysis.
This selection and handling allows for similar functionality as is provided by the MDAnalysis[@gowersMDAnalysisPythonPackage2016] atom selection language, designed for simulations with a fixed topology.
`molify` addresses this by enabling the use of RDKit's powerful SMILES [@weiningerSMILESChemicalLanguage1988]/SMARTS-based substructure searching on ASE structures.
In addition, the resulting molecular graph can be exported to a NetworkX [@hagbergExploringNetworkStructure2008] object for further analysis.
This selection and handling allows for similar functionality as is provided by the MDAnalysis [@gowersMDAnalysisPythonPackage2016] atom selection language, designed for simulations with a fixed topology.

# Features and Implementation

![Visualization of a 3D structure from ASE, visualized with ZnDraw[@elijosiusZeroShotMolecular2024] (left) and its corresponding RDKit 2D chemical structure representation (right).\label{fig:zndraw-rdkit}](zndraw_rdkit.svg)
![Visualization of a 3D structure from ASE, visualized with ZnDraw [@elijosiusZeroShotMolecular2024] (left) and its corresponding RDKit 2D chemical structure representation (right).\label{fig:zndraw-rdkit}](zndraw_rdkit.svg)

The generation of atomic configurations in `molify` is centered around SMILES for defining molecular species.
A typical workflow often follows these steps:
Expand Down Expand Up @@ -81,7 +81,7 @@ opt.run(fmax=0.01)
```
All ASE Atoms objects generated or processed by `molify` store `connectivity` information (bonds and their orders) within the `ase.Atoms.info` dictionary.
If this information is available, `molify` uses it to convert between ASE, NetworkX and RDKit.
If an ASE structure is converted to an RDKit molecule without pre-existing connectivity, `molify` leverages RDKit's bond perception algorithms[@kimUniversalStructureConversion2015] to estimate this information.
If an ASE structure is converted to an RDKit molecule without pre-existing connectivity, `molify` leverages RDKit's bond perception algorithms [@kimUniversalStructureConversion2015] to estimate this information.

A visualisation of the 2D and 3D structure from the simulation is shown in \autoref{fig:zndraw-rdkit}.

Expand Down Expand Up @@ -116,9 +116,6 @@ frames: list[ase.Atoms] = get_substructures(
)
```

# Acknowledgements
F.Z. acknowledges support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) in the framework of the priority program SPP 2363, "Utilization and Development of Machine Learning for Molecular Applications – Molecular Machine Learning" Project No. 497249646. Further funding through the DFG under Germany's Excellence Strategy – EXC 2075 – 390740016 and the Stuttgart Center for Simulation Science (SimTech) was provided.

# Related software
The functionality of `molify` relies critically on the following packages:

Expand All @@ -133,7 +130,10 @@ The `molify` package is currently a crucial part of the following software packa
- [ZnDraw](https://github.com/zincware/zndraw): Interactive generation of simulation boxes and selection of substructures through a graphical user interface inside a web-based visualization package.
- [mlipx](https://github.com/basf/mlipx): Creating initial structures for benchmarking different MLIPs on real-world test scenarios.

The OpenBabel[@oboyleOpenBabelOpen2011] package provides similar cheminformatics functionality to RDKit along with extensive file format support.
The OpenBabel [@oboyleOpenBabelOpen2011] package provides similar cheminformatics functionality to RDKit along with extensive file format support.
However, OpenBabel is primarily designed as a format conversion tool with a focus on command-line usage and file I/O, while `molify` is designed for Python-native workflows with in-memory object conversions.
Currently, OpenBabel does not provide direct support for ASE Atoms objects, ASE calculators (including MLIPs), or seamless RDKit-ASE interconversion within Python.
Furthermore, `molify`'s integration with PACKMOL and NetworkX provides capabilities beyond OpenBabel's core focus on format conversion.

# Acknowledgements
F.Z. acknowledges support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) in the framework of the priority program SPP 2363, "Utilization and Development of Machine Learning for Molecular Applications – Molecular Machine Learning" Project No. 497249646. Further funding through the DFG under Germany's Excellence Strategy – EXC 2075 – 390740016 and the Stuttgart Center for Simulation Science (SimTech) was provided.